Rth: Parallel R through Thrust

Professor Norm Matloff
Dept. of Computer Science
University of California at Davis
Davis, CA 95616
matloff@cs.ucdavis.edu

Authors:
Norm Matloff, UC Davis
Drew Schmidt, University of Tennessee

What are Thrust and Rth?
History and current status of Rth
Platform requirements
Downloading and installing
Example of use
Current Rth package routines
Planned Rth package routines and enhancements
Building the Rth routines manually
Setting the number of threads
Writing your own Rth code
Performance
OpenMP and TBB
Known issues

What are Thrust and Rth?

Thrust is a powerful tool for parallel programming. Rth consists of R library functions that enable one to achieve parallelism through Thrust, without needing to know Thrust or C++.

Here are few details:

Thrust is a C++ package for parallel processing. It was originally designed as a high-level approach to GPU programming, but actually can be compiled to multiple backends--GPUs (CUDA) and multicore (OpenMP, TBB). This ability to have the same parallel code runnable on both GPU and multicore systems is extremely valuable. (You need not have a GPU to run the multicore backends.)
R programmers can use Rth without knowing C++, Thrust, OpenMP, TBB or CUDA!
But those who do know C++/Thrust can easily develop their own apps using the Rth approach.

Thrust operations are high-level, such as sorting, selecting, searching and prefix scan. From these basic building blocks, a myriad of parallel applications can be coded.

For example, consider Rth's rthorder(), which (in one of its options) serves as a parallel version of R's rank(). It first calls Thrust's sort_by_key() function, and then uses Thrust's scatter() function to permute the results in a way that produces the proper ranks.

Pretty Good Parallelism: The Pretty Good Privacy package (PGP) gave the world encryption which, though not optimal, was simple to use and readily available. With Rth, the idea is to have PGP mean "pretty good parallelism"--not necessarily optimal, but easy to develop and with the key virtue that the same code gives reasonably good performance on both GPUs and multicore.

One aspect of this to consider is that in the GPU case Thrust will choose the grid configuration parameters for you, with hopefully pretty good values.

History and current status of Rth:

I (Norm) began the Rth project in 2012, presenting a talk on it at useR! 2012. I was delighted when Drew, who is now of pbdR fame, offered to join the project.

However, I then postponed further work on the project until I reached the Rth chapter my book, Parallel Programming for Data Science (whose draft of the first half you can download here ). We resumed work on Rth in May 2014.

See below for a list of currently available functions Rth. More are coming. User-written Rth apps would be most welcome!

Platform requirements:

Any system with the gcc compiler should work, though testing has been only on Linux and Macs so far. To obtain gcc for Windows, install Rtools, as well as TBB. If you have a GPU, you'll need nvcc.

TBB is sometimes considerably faster than OpenMP. If you wish to use TBB as the backend will need to install TBB separately; see below.

Downloading and installing:

If you will be compiling to GPU backends, see below. For OpenMP/TBB, go to this GitHub site.

Click on Download ZIP.

Unzip the package.

Decide where you want to put the package. I use ~/R below, but since you may want to have 2 or 3 versions of the package, for the various backends, you may wish to think about this a bit.

OMP backend:

From the directory for which Rth-master is a subdirectory, type

$ R CMD INSTALL -l ~/R Rth-master

Done!

TBB backend:

Install TBB from here. Below, I'm assuming I've installed so that (for instance) tbbvars.sh, tbb.h and libtbb.so are in the subdirectories bin, include/tbb and lib/tbb of /usr/local, respectively.

From the directory for which Rth-master is a subdirectory, type

$ export LD_LIBRARY_PATH=/usr/local/lib/tbb
$ R CMD INSTALL -l ~/R Rth-master --configure-args="--with-backend=TBB \
   --with-tbb-incdir=/usr/local/include \
   --with-tbb-libdir=/usr/local/lib"

On a Mac, the library path is set slightly differently:

$ export DYLD_LIBRARY_PATH=/usr/local/lib/tbb

Done! Note, though, that whenever you run the package, the library path must be set as above.

GPU:

At present, the above automatically sets up for the OpenMP and TBB backends only, with the GPU case still under construction. For now, compile by hand, using the instructions below.

Number of threads in multicore settings:

Many R routines in the Rth package allow the user to set the number of threads uses on multicore backends. Some points should be mentioned regarding this.

First, for TBB, the designers of that library do not recommend manually setting the number of threads. In any case, the user can only control the maximum number of threads, not the actual number.

Other than asking one of the R routines in Rth to set this number, one can do it via the environment variable TBB_NUM_THREADS, as with the OpenMP case below.

In the OpenMP case, you can set the number of threads in an environment variable in your shell, e.g.

export OMP_NUM_THREADS=4

Or, set it in a file .Renviron in your home or current directory, e.g. with a line

OMP_NUM_THREADS=2

Note that somehow this variable seems to be set when R first starts up. Thus for instance using R's own environment-setting function, Sys.setenv(), seems NOT to work in this case.

Error messages:

If you get the message function 'dataptr' not provided by package 'Rcpp' you may have forgotten to run

> library(Rcpp)

This is done automatically by the full Rth package, but if you compiled a routine yourself, this is essential. You also may have an out-of-date version of Rcpp.

With large objects you may encounter "Error: long vectors not supported yet." This appears to be related to the fact that R is currently in transition to fully implementing long vectors (and a possible issue with Rcpp).

Example of usage::

> library(Rth)
Loading required package: Rcpp
> x <- runif(25)
> x
 [1] 0.90189818 0.68357514 0.93200351 0.41806736 0.40033254 0.09879424
 [7] 0.70001364 0.01025429 0.30682519 0.74398691 0.04592790 0.57226260
[13] 0.66428642 0.14953737 0.30014257 0.92142903 0.99587218 0.16254603
[19] 0.36737230 0.46898850 0.76138804 0.67405064 0.15926002 0.19043531
[25] 0.81125042
> rthsort(x)
 [1] 0.01025429 0.04592790 0.09879424 0.14953737 0.15926002 0.16254603
 [7] 0.19043531 0.30014257 0.30682519 0.36737230 0.40033254 0.41806736
[13] 0.46898850 0.57226260 0.66428642 0.67405064 0.68357514 0.70001364
[19] 0.74398691 0.76138804 0.81125042 0.90189818 0.92142903 0.93200351
[25] 0.99587218

What is currently available in Rth:

We are slowly adding to the package. Requests are welcome. Contributions even more welcome!

Current routines:

rthcolsums(), (like R's colSums())

rthdist(), distances between rows in a matrix

rthgini(), Gini coefficient

rthhist(), histogram

rthkendall(), Kendall's tau

rthorder(), replacement for R's order() and rank()

rthpdist(), distances between rows of 2 matrices

rthpearson(), Pearson't product-moment correlation

rthrnorm(), normal random number generator

rthrunif(), uniform random number generator

rthsort(), sort routine

rthtable(), contingency table, similar to R's table()

Planned Rth package routines and enhancements:

rthapply(), like R's apply(), user-supplied (C) function

rthgraph(), graph algorithms based on the Thrust graph package

rthkdens(), kernel-based density estimation

rthma(), moving average

rthmerge(), merge 2 sorted vectors

set operations

Incorporate GPU case into autoconfigure build procedure.

Increase speed on multicore backends by using Thrust's retagging feature.

Building the Rth routines manually:

The current package does not automatically build for the GPU case. For the latter, use the instructions below. For completeness, instructions for hand compilation for the OpenMP case are included as well.

Normally, C/C++ code to be linked to R is built by running R CMD SHLIB. This can still be done with Rth, but with some adjustments, described here.

Below are the steps, say for rthsort.cpp. You will need to adjust some file paths for your setting. The examples are for Linux, but Windows should be similar as long as you have gcc.

OpenMP backend:

One needs to inform Thrust which backend we want, and one must tell gcc to include OpenMP. This can be done by setting flags before running R CMD SHLIB:
```
   export CPLUS_INCLUDE_PATH=/home/matloff/R/Rcpp/include
   export PKG_CPPFLAGS="-I/home/matloff/Thrust -fopenmp \
     -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -g" 
   export PKG_LIBS="-lgomp"
   R CMD SHLIB rthsort.cpp
   
```
In this case, there was a subdirectory thrust within /home/matloff/Thrust, with the .h header files in that subdirectory. Also, the Rcpp package was in /home/matloff/R/Rcpp Adjust for your Thrust and R package locations.

This produces rthsort.so, ready to use from R.

TBB backend:

Same as for OpenMP above, with modifications as seen earlier in the TBB build:

   export CPLUS_INCLUDE_PATH=/home/matloff/R/Rcpp/include
   export PKG_CPPFLAGS="-I/home/matloff/Thrust -I/home/matloff/MyInclude/ \
      -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_TBB -g"
   export PKG_LIBS="-L/home/matloff/MyLib/tbb -ltbb"
   R CMD SHLIB rthsortW.cpp

CUDA backend:
In the first stage, one runs a straight nvcc compile (since I've used .cpp suffixes throughout, you need to tell nvcc these are really CUDA files, via -x cu):
```
   export CPLUS_INCLUDE_PATH=/home/matloff/R/Rcpp/include
   nvcc -x cu -c rthsort.cpp -Xcompiler "-fpic" -I/usr/include/R
   
```
This produces rthsort.o, which you input to R CMD SHLIB. For that purpose, you must first indicate where the CUDA files are:
```
   export PKG_LIBS="-L/usr/local/cuda/lib64 -lcudart"
   R CMD SHLIB rthsort.o -o rthsort.so
   
```
Again, this produces rthsort.so.

Execution of hand-compiled Rth code:

Continuing the rthsort example, say for the GPU case:

Before starting R:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

From within R:

library(Rcpp)
dyn.load("rthsort.so")
x <- runif(1000)
.Call("rthsort_double",x)

Writing your own Rth code:

You can use the routines in the Rth package directly, without writing your own code. But if you have experience writing in C++, or at least C, you are encouraged to write your own Rth routines, and the code here has been written to serve as examples which will facilitate that.

At first, don't get distracted by all the "housekeeping" code. Most of this is just constructors to set up needed vectors.

For instance, here is code from the Kendall's Tau routine:

   thrust::counting_iterator seqa(0);
   thrust::counting_iterator seqb =  seqa + n - 1;
   floublevec dx(x,x+n);
   floublevec dy(y,y+n);
   intvec tmp(n-1);
   thrust::transform(seqa,seqb,tmp.begin(),calcgti(dx,dy,n));

The first five lines set up vectors. The sixth line is the one that does the actual computation. You do have to learn how to set up the vectors, but this is easy.

The one C++/Thrust construct new to many programmers will be functors, which are simply callable versions of C structs. Reading an example or two will be enough for you to write your own functors.

See my open source book on parallel programming for details on writing Thrust code. The Thrust chapter can be read without referring to any other material in the book. Also, see the examples developed by the Thrust team.

Older GPUs can handle only single-precision numbers, and even the newer ones may have speed issues with double-precision. R, on the other hand, does not accommodate single-precision code well. Accordingly, the code uses typedef to choose double or float, based on whether the GPU compiler flag is set.

You can of course view Thrust as a black box. But if you wish to learn its innards, I recommend writing little test programs, compiling them for a multicore backend, and then stepping through the code with a debugging tool. One advantage of Thrust's consisting only of "include" files is that the debugger will automatically show source lines.

The standard compiler on Macs is clang, which does not support OpenMP. Though you could also install gcc, your easiest route is to use TBB; see below.

Performance:

The code here represents a first cut at efficient code, and improvements will be made over time. Suggestions welcome!

Currently the best speedups come from rthdist(), rthhist(), rthorder(), rthpdist(), rthrank(), rthsort() and rthtable().

For many of the problems here, parallel operations will be beneficial only for very large cases, as the overhead of copying from Thrust host to Thrust device is substantial in smaller problems. In multicore (OpenMP/TBB) settings this copying is unnecssary; a suggestion for how to avoid it is here, and the Rth routines could be tweaked accordingly.

Some of the Rcpp code might be inefficient. Also, Rcpp11 may bring an improvement.

It is often the case that TBB code runs faster than OpenMP. You may wish to give both a try for any particular app.

Memory size in GPUs tends to be limited. If you have a problem that is too large for GPUs, you may wish to modify the Rth algorithm by breaking the data into chunks.

In some cases, the Rth routines yield a superlinear speedup over their R counterparts. This would seem to be an algorithm issue, rather than a computational one per se.

OpenMP and TBB:

You need not know the following in order to use Rth, but it is useful background information.

OpenMP is a very widely used open source framework for programming on multicore systems. It consists of pragmas that a C/C++/Fortran programmer can insert into code in those languages, basically instructions along the lines of "do the following section in parallel."

Threading Building Blocks (TBB), is another system for parallel programming, developed by Intel. It can often produce faster code than OpenMP. However, TBB is considerably more difficult to prograam in, though in the Thrust context the programmer does not get directly involved in TBB, which just serves as a possible backend.

Known issues:

Like R itself, Rth comes with no warranty of any kind. The package has been tested and worked well in those tests. If you find issues upon further testing, please send a bug repor to one of the authors.

Apparently due to a complex problem in the interaction of two or more of R, Rcpp, TBB and Thrust, rthtable() and rthpdist() currently do not work under TBB. There seem to be similar problems in CUDA, for rthcolsums(), rthkendall(), rthrnorm() and rthrunif().

The random number generators rthrnorm() and rthrunif() use the underlying Thrust RNG system, on which we have not done thorough testing.

Rth: Parallel R through Thrust

Contents:

Error messages: