Professor Norm Matloff
Dept. of Computer Science
University of California at Davis
Davis, CA 95616
Norm Matloff, UC Davis
Drew Schmidt, Oak Ridge National Labs
Operations are high-level, such as sorting, selecting, searching and prefix scan. From these basic building blocks, a myriad of parallel applications can be coded.
For example, consider Rth's rthrank(), which serves as a parallel version of R's rank(). It first calls Thrust's sort_by_key() function, and then uses Thrust's scatter() function to permute the results in a way that produces the proper ranks.
We are slowly adding to the package. Requests are welcome. Contributions even more welcome.
The two main functions for calling C/C++ code from R are .C() and .Call(). Although the latter is more complex, it tends to produce faster code, and moreover there is an excellent tool for using it, Rcpp, which makes it easier than .C() to develop for. So, we are currently transitioning to having all functions use .Call().
I'm assuming you have R, of course, and the gcc compiler. If you have a CUDA-capable NVIDIA GPU, I assume you have the nvcc compiler and CUDA development package. (All the above software tools are free.)
For the OpenMP case without nvcc, you'll need to download Thrust, from here.
In any case, be sure to use at least version 1.5 of Thrust.
Note: Sorry for the Linux/Mac-centric treatment here. If you are familiar with compiling and linking issues, it will be simple for you to convert to Windows.
Also, sorry that none of this is automated yet.
Normally, C/C++ code to be linked to R is built by running R CMD SHLIB. This can still be done with Rth, but with some adjustments, described here.
Below are the steps, say for rthsort.cpp. You will need to adjust some file paths for your setting.
One needs to inform Thrust which backend we want, and one must tell gcc to include OpenMP. This can be done by setting flags before running R CMD SHLIB:
export CPLUS_INCLUDE_PATH=/home/matloff/R/Rcpp/include export PKG_CPPFLAGS="-I/home/matloff/Thrust -fopenmp \ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -g" export PKG_LIBS="-lgomp" R CMD SHLIB rthsort.cpp
In this case, there was a subdirectory thrust within /home/matloff/Thrust, with the .h header files in that subdirectory. Also, the Rcpp package was in /home/matloff/R/Rcpp Adjust for your Thrust and R package locations.
This produces rthsort.so, ready to use from R.
In the first stage, one runs a straight nvcc compile (since I've used .cpp suffixes throughout, you need to tell nvcc these are really CUDA files, via -x cu):
export CPLUS_INCLUDE_PATH=/home/matloff/R/Rcpp/include nvcc -x cu -c rthsort.cpp -DGPU -Xcompiler "-fpic" -I/usr/include/R
This produces rthsort.o, which you input to R CMD SHLIB. For that purpose, you must first indicate where the CUDA files are:
export PKG_LIBS="-L/usr/local/cuda/lib64 -lcudart" R CMD SHLIB rthsort.o -o rthsort.so
Again, this produces rthsort.so.
Each Rth routine has its own directory here, such as the one for the sort routine. There is a file named Usage in each one, showing usage and a small example, in some cases along with timing comparisons.
Copy the .so and .R files to the directory from which you will use them.
Note that in the OpenMP case, you can set the number of threads in an environment variable, e.g.
Absent that variable, OpenMP will likely run with the full thread capacity of your machine, which will probably be a reasonable value to use.
You can use the routines in the Rth package directly, without writing your own code. But if you have experience writing in C++, or at least C, you are encouraged to write your own Rth routines, and the code here has been written to serve as examples which will facilitate that.
At first, don't get distracted by all the "housekeeping" code. Most of this is just constructors to set up needed vectors. The heart of the code for each routine here is described in the HowItWorks file in the given directory.
For instance, here is code from the Kendall's Tau routine:
seqa(0); thrust::counting_iterator seqb = seqa + n - 1; floublevec dx(x,x+n); floublevec dy(y,y+n); intvec tmp(n-1); thrust::transform(seqa,seqb,tmp.begin(),calcgti(dx,dy,n));
The first five lines set up vectors. The sixth line is the one that does the actual computation. You do have to learn how to set up the vectors, but this is easy.
The one C++/Thrust construct new to many programmers will be functors, which are simply callable versions of C structs. Reading an example or two will be enough for you to write your own functors.
See my open source book on parallel programming for details on writing Thrust code. The Thrust chapter can be read without referring to any other material in the book.
Older GPUs can handle only single-precision numbers, and even the newer ones may have speed issues with double-precision. R, on the other hand, does not accommodate single-precision code well. Accordingly, the code uses typedef to choose double or float, based on whether the GPU compiler flag is set.
You can of course view Thrust as a black box. But if you wish to learn its innards, I recommend writing little test programs, compiling them for a multicore backend, and then stepping through the code with a debugging tool. One advantage of Thrust's consisting only of "include" files is that the debugger will automatically show source lines.