The code in Smoother.R finds nonparametric estimates of regression and density functions, using the k-Nearest Neighbor method. The number of variables can be general, not just 1 or 2. Since large amounts of computation might be involved, it optionally allows parallel computing of the estimate.

The description will focus on the regression case. A regression function is a conditional mean of one variable (the response variable) given values of other variables (the predictor variables). We have sample data on both response and predictors, and want to predict future data points in which only the predictor variables have known values. We estimate the conditional mean to do so.

As a simple example, suppose we are predicting people's weights
(response) from their heights and ages (predictors). Say we have a
person of height 70 inches and age 32 but unknown weight. We would use our
data to estimate the mean weight of all people in the population who have
that height and age, and use that estimate as our predicted weight
for this new person. That estimate would consist of the mean weight
in our sample among the **k** people who are nearest
that height and age.

Euclidean distance (square root of sums of squares) is used in defining
nearness. The user may wish to center and scale the predictor data
first, using R's **cale()** function, so that distances
put all variables on an equal footing, each one now having mean 0 and
variance 1.

Here is an example. My matrix **x**, randomly
generated, is

> x [,1] [,2] [,3] [1,] 7 37 36 [2,] 21 3 24 [3,] 2 18 42 [4,] 49 35 31 [5,] 34 28 17

Here the first 2 columns are predictors, and the third is the response. Let's refer to them as U, V and W.

I called the smoother (it "smooths" by averaging the response in neighborhoods):

> estm <- smoothz(x,knnreg,2) > estm [1] 29.5 29.5 30.0 26.5 27.5

Here I said I wanted regression estimation, using 2 nearest neighbors. Let's look at the last number, for instance. It says that our estimate for the mean population W among all entities (people, cars, whatever) for which U = 34 and V = 28 is 27.5. Let's check it:

The two closest points to data point 5 are in rows 4 and 2, respectively. So, we estimate the regression function at row 5 to be (31+24)/2 = 27.5.

Note that in finding nearest neighbors of a point, the point itself is not counted. Thus a data point is not being used to predict itself, which guards somewhat against overfitting.

Now, let's try predicting some new data, for which we have U and V data but not W. Our new data, again made up, is

> newx [,1] [,2] [1,] 28 18 [2,] 10 15

So, we have 2 new data points, one per row. Here are our predictions for their W values:

> smoothzpred(newx,x[,1:2],estm) [1] 27.5 30.0

So, we predict the first point's W to be 27.5.

The choice of **k** is somewhat similar to the choice of bin width in
histograms, i.e. there is no magic formula for choosing a good value of
k. There is an obvious tradeoff--too small a **k** means that averages are
computed on too few data points, while too large a **k** means taking "near"
neighbors that may not be similar to the point being predicted. In the
interest of avoiding really large amounts of computation, I suggest
setting **k** to **min(100,nrow(z))**.

The FNN library is used for speed of computation, but for large data sets parallel computation will be helpful. For this we will set up a Snow cluster.

On R's CRAN repository of contributed code, **snow** used
to be a popular method for doing parallel computation with R. It has
recently been incorporated into the library **parallel** in
base R, but for convenience I will call that portion of the library
Snow.

A Snow cluster is not physical; it is merely an agreement to communicate
among several R processes. Say for instance I am running R on the
machine **pc36** in the CSIF lab at UC Davis. That is a
dual-core machine, so I could emit the commands

> library(parallel) > c2 <- makeCluster(2)

and then set **cls = c2** in my calls to the smoother.
That would give me a potentially factor-of-2 speedup (possibly even
more, for reasons I won't go into here).

Or, I can build a cluster on several machines, e.g.

> cl <- makeCluster(c("pc36","pc37","pc38"))

Note carefully that if you create a cluster of r workers, that launches R separate instances of R! So, be sure to shut them down when you're done, e.g.

> stopCluster(cl)