Parallel k-Nearest Neighbor Estimators

The code in Smoother.R finds nonparametric estimates of regression and density functions, using the k-Nearest Neighbor method. The number of variables can be general, not just 1 or 2. Since large amounts of computation might be involved, it optionally allows parallel computing of the estimate.

The description will focus on the regression case. A regression function is a conditional mean of one variable (the response variable) given values of other variables (the predictor variables). We have sample data on both response and predictors, and want to predict future data points in which only the predictor variables have known values. We estimate the conditional mean to do so.

As a simple example, suppose we are predicting people's weights (response) from their heights and ages (predictors). Say we have a person of height 70 inches and age 32 but unknown weight. We would use our data to estimate the mean weight of all people in the population who have that height and age, and use that estimate as our predicted weight for this new person. That estimate would consist of the mean weight in our sample among the k people who are nearest that height and age.

Euclidean distance (square root of sums of squares) is used in defining nearness. The user may wish to center and scale the predictor data first, using R's cale() function, so that distances put all variables on an equal footing, each one now having mean 0 and variance 1.

Here is an example. My matrix x, randomly generated, is

> x
     [,1] [,2] [,3]
[1,]    7   37   36
[2,]   21    3   24
[3,]    2   18   42
[4,]   49   35   31
[5,]   34   28   17

Here the first 2 columns are predictors, and the third is the response. Let's refer to them as U, V and W.

I called the smoother (it "smooths" by averaging the response in neighborhoods):

> estm <- smoothz(x,knnreg,2)
> estm
[1] 29.5 29.5 30.0 26.5 27.5

Here I said I wanted regression estimation, using 2 nearest neighbors. Let's look at the last number, for instance. It says that our estimate for the mean population W among all entities (people, cars, whatever) for which U = 34 and V = 28 is 27.5. Let's check it:

The two closest points to data point 5 are in rows 4 and 2, respectively. So, we estimate the regression function at row 5 to be (31+24)/2 = 27.5.

Note that in finding nearest neighbors of a point, the point itself is not counted. Thus a data point is not being used to predict itself, which guards somewhat against overfitting.

Now, let's try predicting some new data, for which we have U and V data but not W. Our new data, again made up, is

> newx
     [,1] [,2]
[1,]   28   18
[2,]   10   15

So, we have 2 new data points, one per row. Here are our predictions for their W values:

> smoothzpred(newx,x[,1:2],estm)
[1] 27.5 30.0

So, we predict the first point's W to be 27.5.

The choice of k is somewhat similar to the choice of bin width in histograms, i.e. there is no magic formula for choosing a good value of k. There is an obvious tradeoff--too small a k means that averages are computed on too few data points, while too large a k means taking "near" neighbors that may not be similar to the point being predicted. In the interest of avoiding really large amounts of computation, I suggest setting k to min(100,nrow(z)).

The FNN library is used for speed of computation, but for large data sets parallel computation will be helpful. For this we will set up a Snow cluster.

On R's CRAN repository of contributed code, snow used to be a popular method for doing parallel computation with R. It has recently been incorporated into the library parallel in base R, but for convenience I will call that portion of the library Snow.

A Snow cluster is not physical; it is merely an agreement to communicate among several R processes. Say for instance I am running R on the machine pc36 in the CSIF lab at UC Davis. That is a dual-core machine, so I could emit the commands

> library(parallel)
> c2 <- makeCluster(2)

and then set cls = c2 in my calls to the smoother. That would give me a potentially factor-of-2 speedup (possibly even more, for reasons I won't go into here).

Or, I can build a cluster on several machines, e.g.

> cl <- makeCluster(c("pc36","pc37","pc38"))

Note carefully that if you create a cluster of r workers, that launches R separate instances of R! So, be sure to shut them down when you're done, e.g.

> stopCluster(cl)