Professor Norm Matloff
Dept. of Computer Science
University of California at Davis
Davis, CA 95616
Why use anything else? As the Cantonese say, yauh peng, yauh leng--"both inexpensive and beautiful."
Its virtues:
I should warn you that one submits commands to R via text, rather than mouse clicks in a Graphical User Interface (GUI). If you can't live without GUIs, you should consider using one of the free GUIs that have been developed for R, e.g. R Commander or JGR. Note that R definitely does have graphics--tons of it. But the graphics are for the output, e.g. plots, not for the input.
Though the terms object-oriened and functional programming may pique the interests of computer scientists, they are actually quite relevant to anyone who uses R.
The term object-oriented can be explained by example, say statistical regression. When you perform a regression analysis with other statistical packages, say SAS or SPSS, you get a mountain of output. By contrast, if you call the lm() regression function in R, the function returns an object containing all the results--estimated coefficients, their standard errors, residuals, etc. You then pick and choose which parts of that object to extract, as you wish.
Computer scientists would say that R is polymorphic, which means that the same function can be applied to different types of objects, with results tailored to the different object types. Such a function is called a generic function. Consider for instance the plot() function. If you apply it to a simple list of numbers, you get a simple plot of them, but if you apply it to the output of a regression analysis, you get a set of plots of various aspects of the regression output. This is nice, since it means that you, as a user, have fewer commands to remember! For instance, you know that you can use the plot() function on just about any object produced by R.
The object orientation also allows you to combine several commands, each one using the output of the last, with the resulting combination being quite powerful and extremely flexible. (Unix users will recognize the similarity to Unix shell pipe commands.) For example, consider this (compound) command:
nrow(subset(x03,z==1))
First the subset() function would take the data frame x03, and cull out all those records for which the variable z has the value 1. The resulting new frame would be fed into nrow(), the function that counts the number of rows in a frame. The net effect would be to report a count of z = 1 in the original frame.
The functional programming aspect of importance to most users will be that one applies an operation to an entire list of elements, one by one, making things extremely convenient. This rather vague statement will be clarified as things unfold here, so don't worry about it now.
There are many excellent tutorials on R on the Web, some of which I list in the section titled "Help on the Web below. My tutorial is designed to play a role complementary to those others. It is different from them in these senses:
(If you wish to learn R from the perspective of a programmer---whether professional or ``amateur''---you may prefer to read my writeup R for Programmers instead of the tutorial here.)
(If you do not already have R installed, go first to the section below titled How to Obtain/Install R.)
Start R, by typing R on the command line (Unix) or in a Windows Run window. You'll get a greeting, and then the R prompt, the >:
R : Copyright 2005, The R Foundation for Statistical Computing Version 2.1.1 (2005-06-20), ISBN 3-900051-07-0 ... Type `q()' to quit R. >
Now let's make a simple data set, a vector in R parlance, consisting of the numbers 1, 2 and 4, and name it x:
> x <- c(1,2,4)
The "c" stands for "concatenate," i.e. string together the numbers 1, 2 and 4 into one object, which we then assigned to x.
Since objects are just one-element vectors, we can concatenate vectors too. For instance,
> q <- c(x,x,8)
would set q to (1,2,4,1,2,4,8).
Since "seeing is believing," go ahead and confirm that the data is really in x; to print the vector to the screen, simply type its name. (Python programmers will find this feature familiar.) For example,
> x [1] 1 2 4
Yep, sure enough, x consists of the numbers 1, 2 and 4.
(The "[1]" helps users read voluminous output consisting of many rows. No such help is needed here, but you will see the need later in our second example.)
Vectors indices begin at 1. Remember, one can always print an object in R by simply typing its name, so let's print out the third element of x:
> x[3] [1] 4
Might as well find the mean and standard deviation:
> mean(x) [1] 2.333333 > sd(x) [1] 1.527525
If we had wanted to save the mean in a variable instead of just printing it to the screen, we could do, say,
> y <- mean(x)
Again, since you are learning, let's confirm that y really does contain the mean of x:
> y [1] 2.333333
By the way, we use # to write comments, e.g.
> y # print out y [1] 2.333333
These of course are especially useful when writing programs (see our the section titled "R Programming below), but they are useful for interactive use too, since R does record your commands (see "Session Data" below). The comments then help you remember what you were doing when you read the record.
As the last action in this quick introduction to R, let's have R draw a histogram of the data:
> hist(x)
A window pops up with the histogram in it. This one won't be very pretty, but R has all kinds of bells and whistles you can use optionally. For instance, you can change the number of bins by specifying the breaks variable; hist(z,breaks=12) would draw a histogram of the data z with 12 bins. You can make nicer labels, etc. Also, if you prefer the more sophisticated kernel method for estimating a density, R offers the density() function, an example of which is in our section below titled "Plotting Multiple Curves, Using the lines() Function ".
Well, that's the end of this first 5-minute introduction. We leave by calling the quit function (or optionally by hitting ctrl-d in Unix):
> q() Save workspace image? [y/n/c]: n
That last question asked whether we want to save our variables, etc., so that we can resume work later on. If we answer y, then the next time we run R, all those objects will automatically be loaded. This is a very important feature; see more in our section "Session Data" below.
Here we will do some multivariate analysis, and also introduce data frames, which allow a richer structure than simply using individual vectors.
As my sample data set, I have created a file named exams, consisting of grades for the three exams in a certain course (two midterm exams and a final exam). The first few lines in the file are
Exam1 Exam2 Exam3 62 70 60 74 34 64 50 35 40 ...
Note that I have separated fields here by spaces. (R can also use .csv files, from spreadsheets. See the material on .csv files below for details.)
As you can see, other than the first record, which contains the names of the columns (i.e. the variables), each line contains the three exam scores for one student. This is the classical "two-dimensional file" notion, i.e. each line in our file contains the data for one observation from our sample. The idea of a data frame is to encapsulate such data, along with variable names and even line names, into one object.
As mentioned, I've specified the variable names in the first record. Our variable names didn't have any embedded spaces in this case, but if they had, we'd need to quote any such name.
Suppose the second exam score for the third student had been missing. Then we would have typed
50 NA 40
in that line of the exams file. In any subsequent statistical analyses, R would do its best to cope with the missing data, in the obvious manners. If for instance we had wanted to find the mean score on Exam 2, R would find the mean among all students except the third.
We first read in the data from the file exams into a data frame which we'll name testscores:
> testscores <- read.table("exams",header=TRUE)
The parameter header=TRUE tells R that we do have a header line (for the variable names), so R should not count that first line in the file as data.
By the way, note the use of named arguments here. The function read.table() has a number of arguments, some of which are optional, which means that we must specify which arguments we are using, by using their names, e.g. header-TRUE above. (Again, Python programmers will find this familiar.) The ones you don't specify all have default values. Some values can be abbreviated, e.g. T for TRUE.
In R, the components of an object are accessed via the $ operator. For example, the vector of all the Exam1 scores is testscores$Exam1, as we confirm here:
> testscores$Exam1 [1] 62 74 50 62 39 60 48 80 49 49 100 30 61 100 82 37 54 65 36 [20] 97 60 80 70 50 60 24 60 75 77 71 25 93 80 92 75 26 27 55 [39] 30 44 86 35 95 98 50 50 34 100 57 99 67 77 70 53 38
(The [1] means that items 1-19 start here, the [20] means that items 20-38 start here, etc.)
So, you can see that the data frame testscores is a way of packaging the vectors of values of the individual variables, plus the associated variable names. For instance, instead of retrieving the means of the three variables individually, by calling mean() on each vector, we can get the whole group of means via one single command:
> colMeans(testscores) Exam1 Exam2 Exam3 62.14545 51.27273 50.05455
We can add columns, i.e. new variables, to the data frame. For example, in testscores, we can add a variable which is the difference between Exams 1 and 2:
> testscores$Diff21 <- testscores$Exam2 - testscores$Exam1
By the way, it can get pretty tiring to type out expressions like testscores$Exam3 all the time, so R gives us a shortcut, using the attach() function:
> attach(testscores)
This command tells R that from now on, when we refer, for example, to Exam3, we mean testscores$Exam3:
> mean(Exam3) [1] 50.05455
If we want R to stop doing that, we use detach(), e.g.
> detach() > mean(Exam3) Error in mean(Exam3) : Object "Exam3" not found
Let's see how well the third exam score can be predicted from the first, using a linear regression function:
> fit1 <- lm(Exam3 ~ Exam1,data=testscores)
Here we fit the model
E(Exam3 | Exam1) = c_{0} + c_{1} Exam1
under which we assume that the mean Exam3 score for a given Exam1 score is a linear function of the latter. (The function name "lm" stands for "linear model.")
Now the object fit1 contains all the results of the regression run, i.e. coefficients, residuals, etc. We can pick out that information individually if we wish, e.g. obtain the estimated regression coefficients:
> fit1$coefficients (Intercept) Exam1 3.7008841 0.7458898
Note the effect of object orientation here. The coefficients themselves are accompanied within the object by their names, and when we print the coefficients R knows to print the names with them. Yet at the same time we can work them as numbers; R will be smart enough to leave the names out of things then.
Or, if we wish, we can get a ton of information in one fell swoop, the way we do in other statistical packages:
> summary(fit1) Call: lm(formula = Exam3 ~ Exam1, data = testscores) Residuals: Min 1Q Median 3Q Max -42.9132 -11.0895 0.9389 12.0786 32.8163 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.70088 6.52037 0.568 0.573 Exam1 0.74589 0.09877 7.552 5.85e-10 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 16.31 on 53 degrees of freedom Multiple R-Squared: 0.5183, Adjusted R-squared: 0.5092 F-statistic: 57.03 on 1 and 53 DF, p-value: 5.85e-10
So for example, we estimate that Exam 1 scores explain about 50% of the variation in Exam 3 scores.
Let's now make a scatter plot of the original Exam1-Exam3 data, together with the fitted regression line. First, the scatter plot:
> plot(testscores$Exam1,testscores$Exam3)
A scatter plot window then pops up. (Again, it won't be very fancy for the time being, since we are using default values, but you can make it fancy with some added commands when you learn more about R.)
Now, we superimpose the fitted regression line onto that plot:
> abline(fit1)
The name of this function is meant to suggest, "Draw a line with Y intercept a and slope b." So, for instance, the call abline(2,5) will draw the line y = 2 + 5x. But if the argument to the call is a regression object, e.g. fit1 above, then abline() knows to use the regression coefficients as the intercept and slope of the line to be drawn. Again we see polymorphism at work.
(The function abline() can take on various forms of arguments. The simplest just specifies two arguments, a Y-intercept value and a slope. In the case above, it takes a regression output as its argument.
We could get a number of plots related to this fit by typing
plot(fit1)
If we had wanted to regress Exam3 against both Exam1 and Exam2, we would issue the following command:
> fit12 <- lm(testscores$Exam3 ~ testscores$Exam1 + testscores$Exam2)
R has a very rich set of graphics facilities. The front of the R home page has some colorful examples. There are many extensive Web tutorials on this (some are listed below), and the entire book, R Graphics, by Paul Murrell (Chapman and Hall, 2005), is devoted to presenting the various graphics facilities of R. I cannot cover even a small part of it here, but we will give you enough foundation to work and learn more.
This is the workhorse function for graphing, serving as the vehicle for producing many different kinds of graphs. As mentioned earlier, it senses from the type of the object sent to it what type of graph to make.
We typically first call plot() with an X vector and a Y vector, which are interpreted as a set of (X,Y) points. For example,
> plot(c(1,2,3), c(1,2,4))
plots the points (1,1), (2,2) and (3,4).
The plot() function works in stages, i.e. you can build up a graph in stages by giving more and more commands. Recall that for instance in our our section titled "A Second R Session" above, we first created a scatter plot and then later superimposed the fitted regression line onto the same graph. (You do not need to have read that section to follow the material here.) Often we will just call plot() first to create the X and Y axes of the graph, and their labels, and then add to it with subsequent calls to various functions. A common way of doing this is to use the lines() function.
Though there are many options, the two basic arguments to lines() are a vector of X values and a vector of Y values. These are interpreted as (X,Y) pairs representing points to be added to the current graph, with lines connecting the points.
For instance, if x and y are the vectors (1.5,2.5) and (3,), then the call
> lines(c(1.5,2.5),c(3,3))
would add a line from (1.5,3) to (2.5,3) to the present graph.
As another example, let's plot nonparametric density estimates (these are basically smoothed histograms) for Exams 1 and 2 from our exams file in our section titled "A Second R Session" above in the same graph. We use the function density() to generate the estimates. Here are the commands we issue:
> d1 = density(testscores$Exam1,from=0,to=100) > d2 = density(testscores$Exam2,from=0,to=100) > plot(d2,main="",xlab="") > lines(d1)
Here's what we did: First, we computed nonparametric density estimates from the two variables, saving them in objects d1 and d2 for later use. We then called plot() to draw the curve for Exam2. The internal structure of d2 contains vectors of X and Y coordinates needed by plot() to draw the figure. We then called lines() to add Exam1's curve to the graph.
Note that we asked R to have blank labels for the figure as a whole and for the X axis; otherwise, R would have gotten such labels from d2, which would have been specific to Exam 2.
If you want the lines "connecting the dots" but don't want the dots themselves, include type="l" in your call to lines():
> plot(x,y,type="l") > lines(x,z,type="l")
The call to plot() both initiates the plot and draws the first curve. (Without specifying type="l", only the points would have been plotted.) The call to lines() then adds the second curve.
You can use the lty parameter in plot() to specify the type of line, e.g solid, dashed, etc. Type
> help(par)
to see the various types and their codes.
It may be very useful to save your graphics commands to a file. For example, we could put the above commands in a file, testscores.r, with contents
d1 = density(testscores$Exam1,from=0,to=100) d2 = density(testscores$Exam2,from=0,to=100) plot(d2,main="",xlab="") lines(d1)
To execute these commands, we would type
> source("testscores.r")
The advantage of this is that if we ever wanted to rerun this sequence of commands, say with a slightly different set of parameters, we need only edit the file testscores.r to reflect the parameter changes and then run as above.
When source() is executed, the code in the specified file is run. In our case here, we just had a function in the file, so the effect was for R to read the function in to its list of functions. But if our file had contained any code not in a function, it would have been executed.
R has the notion of a graphics device. The default device is the screen. If we want to have a graph saved to a file, we must set up another device. For example, if we wish to save as a PDF file, we do something like this:
> pdf("d12.pdf")
This opens a file, which we have chosen here to call d12.pdf. We now have two devices open, as we can confirm:
> dev.list() X11 pdf 2 3
Our first device, the screen, is named X11 when R runs on Unix; it is device number 2. Our PDF file is device number 3.
Our current device is now the PDF file (which we could confirm by calling dev.cur()). All graphics output will now go to this file instead of to the screen. But what if we wish to save what's already on the screen? We could re-establish the screen as the current device, then copy it to the PDF device:
> dev.set(2) X11 2 > dev.copy(which=3) pdf 3
Note carefully that the PDF file is not usable until we close it, which we do as follows:
> dev.set(3) pdf 3 > dev.off() X11 2
(We could also close the device by exiting R, though it's probably better to proactively close.)
The above set of operations to print a graph can become tedious if used a lot, so it makes sense to put them into a file.
The points() function adds a set of (x,y)-points, with labels for each, to the currently displayed graph. For instance, in our second 5-minute example above:
points(testscores$Exam1,testscores$Exam3,pch="+")
would superimpose onto the current graph the points of the exam scores from that example, using "+" signs to mark them.
As with most of the other graphics functions, there are lots of options, e.g. point color, background color, etc.
A nice function is legend(), which is used to add a legend to a multicurve graph. For instance,
would place a legend at the point (2000,31162) in the graph, with a little line of type 1 and label of "CS". Try it!
Say you wanted to plot the function g(t) = (t^{2}+1)^{0.5} for t between 0 and 5. You could use the following R code:
g <- function(t) { return (t^2+1)^0.5 } # define g() x <- seq(0,5,length=10000) # x = [0.0004, 0.0008, 0.0012,..., 5] y <- g(x) # y = [g(0.0004), g(0.0008), g(0.0012), ..., g(5)] plot(x,y,type="l")
But even better, you could use the curve() function:
> curve((x^2+1)^0.5,0,5)
Just plotting a cloud of points, whether connected or not, may turn out to be just an uninformative mess. In many cases, it is better to smooth out the data by fitting a nonparametric regression estimator, such as lowess():
plot(lowess(x,y))
The call lowess(x,y) returns the pairs of points on the regression curve, and then plot() plots them. Of course, we could get both the cloud and the smoothed curve:
plot(x,y) lines(lowess(x,y))
Use the text() function to place some text anywhere in the current graph. For example,
text(2.5,4,"abc")
would write the text "abc" at the point (2.5,4) in the graph. The center of the string, in this case "b", would go at that point.
In order to get a certain string placed exactly where you want it, you may need to engage in some trial and error. R has no "undo" command (though the ESS interface to R described below does). For that reason, you might want to put all the commands you're using to build up a graph in a file, and then use source() to execute them. See the section on Startup Files below.
But you may find the locator() function to be a much quicker way to go. See the section on it below.
Typing
locator(1)
will tell R that you will click in 1 place in the graph. Once you do so, R will tell you the exact coordinates of the point you clicked on. Call locator(2) to get the locations of 2 places, etc.
The cex() ("character expand") function allows you to expand or shrink characters within a graph, very useful. You can use it as a named parameter in various graphing functions.
text(2.5,4,"abc",cex = 1.5)
would print the same text as in our earlier example, but with characters 1.5 times normal size.
You may wish to have the ranges on the X- and Y-axes of your plot to be broader or narrower than the default. You can do this by specifying the xlim and/or ylim parameters in your call to plot() or points(). For example, ylim=c(0,90000) would specify a range on the Y-axis of 0 to 90000.
This is especially useful if you will be displaying several curves in the same graph. Note that if you do not specify xlim and/or ylim, then draw the largest curve first, so there is room for all of them.
You must warn R ahead of time that you instead a variable to be one of these types. For instance, say we wish y to be a two-component vector with values 5 and 12. If you try
> y[1] <- 5 > y[2] <- 12
the first command (and the second) will be rejected, but
> y <- vector(length=2) > y[1] <- 5 > y[2] <- 12
works, as does
> y <- c(5,12)
(The latter is OK because the right-hand side is a vector type, so we are assigning y to a vector.)
Vector elements must be scalars, i.e. numbers or character strings.
Multidimensional vectors in R are called arrays. A two-dimensional array is also called a matrix, and is eligible for the usual matrix mathematical operations. I will concentrate on matrices here.
When applying an operation to two vectors which requires them to be the same length, the shorter one will be recycled, i.e. repeated, until it is long enough to match the longer one, e.g.
> c(1,2,4) + c(6,0,9,20,22) [1] 7 2 13 21 24 Warning message: longer object length is not a multiple of shorter object length in: c(1, 2, 4) + c(6, 0, 9, 20, 22)
Matrix subscripts begin with 1, so for instance the upper-left corner of the matrix a is denoted a[1,1]. The internal linear storage of a matrix is in column-major order, meaning that first all of column 1 is stored, then all of column 2, etc.
One of the ways to create a matrix is via the matrix() function, e.g.
> y <- matrix(c(1,2,3,4),nrow=2,ncol=2) > y [,1] [,2] [1,] 1 3 [2,] 2 4
Here we concatenated what we intended as the first column, the numbers 1 and 2, with what we intended as the second column, 3 and 4. That was our data in linear form, and then we specified the number of rows and columns. The fact that R uses column-major order then determined where these four numbers were put.
Since we specified the matrix entries in the above example, we would not have need to specify ncol; nrow would be enough. For instance:
> y <- matrix(c(1,2,3,4),nrow=2) > y [,1] [,2] [1,] 1 3 [2,] 2 4
Note that when we then printed out y, R showed us its notation for rows and columns. For instance, [,2] means column 2, as can be seen in this check:
> y[,2] [1] 3 4
As you can see, a matrix is really a vector, with extra atributes, the number of rows and columns.
Another way we could have built y would have been to specify elements individually:
> y <- matrix(nrow=2,ncol=2) > y[1,1] = 1 > y[2,1] = 2 > y[1,2] = 3 > y[2,2] = 4 > y [,1] [,2] [1,] 1 3 [2,] 2 4
We can perform various operations on matrices, e.g. matrix multiplication, matrix scalar multiplication and matrix addition:
> y %*% y [,1] [,2] [1,] 7 15 [2,] 10 22 > 3*y [,1] [,2] [1,] 3 9 [2,] 6 12 > y+y [,1] [,2] [1,] 2 6 [2,] 4 8
Note that for matrix multiplication in the mathematical sense, the operator to use is %*%, not *. Note also that a vector is considered a one-row matrix, not a one-column matrix, and thus is suitable as the left factor in a matrix product, but not directly usable as the right factor.
A quick way to create the identity matrix of dimension d is diag(d).
Use t() for matrix transpose.
The same operations we discussed in our section above titled "Vector Slicing" apply to matrices. For instance:
> z [,1] [,2] [,3] [1,] 1 1 1 [2,] 2 1 0 [3,] 3 0 1 [4,] 4 0 0 > z[,c(2,3)] [,1] [,2] [1,] 1 1 [2,] 1 0 [3,] 0 1 [4,] 0 0
Here's another example:
> y <- matrix(c(11,21,31,12,22,32),nrow=3,ncol=2) > y [,1] [,2] [1,] 11 12 [2,] 21 22 [3,] 31 32 > y[2:3,] [,1] [,2] [1,] 21 22 [2,] 31 32 > y[2:3,2] [1] 22 32
The numbers of rows and columns in a matrix can be obtained through the nrow() and ncol() functions, e.g.
> x [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > nrow(x) [1] 3
The function solve() will solve systems of linear equations, and even find matrix inverses. For example:
> a <- matrix(c(1,1,-1,1),nrow=2,ncol=2) > b <- c(2,4) > solve(a,b) [1] 3 1 > solve(a) [,1] [,2] [1,] 0.5 0.5 [2,] -0.5 0.5
The function write.table() works very much like read.table(). The latter reads an ASCII data file, and the former writes one. For example
write.table(x,"x.dat",row.names=F,col.names=F)
writes the matrix x to a file x.dat, with no column or row labels.
The rbind() and cbind() functions enable one to add rows or columns to a matrix or data frame.
Example:
> one [1] 1 1 1 1 > z [,1] [,2] [,3] [1,] 1 1 1 [2,] 2 1 0 [3,] 3 0 1 [4,] 4 0 0 > cbind(one,z) one [1,] 1 1 1 1 [2,] 1 2 1 0 [3,] 1 3 0 1 [4,] 1 4 0 0
One can refer to the rows and columns of a data frame using two-dimensional array notation. For instance, in our earlier example data frame testscores from our second 5-minute example above:
We saw in our 10-minute introduction above how to create a data frame by reading from a data file. We can also create a data frame directly, using the function data.frame().
For example,
> z <- data.frame(cbind(c(1,2),c(3,4))) > z X1 X2 1 1 3 2 2 4
Note the use of the cbind() function, described in our section titled "Adding More Rows or Columns" above.
We can also coerce a matrix to a data frame, e.g.
> x <- matrix(c(1,2,3,4),nrow=2,ncol=2) > x [,1] [,2] [1,] 1 3 [2,] 2 4 > y <- data.frame(x) > y X1 X2 1 1 3 2 2 4
As you can see, the column names will be X1, X2, ... However, you can change them, e.g.
> z X1 X2 1 1 3 2 2 4 > names(z) <- c("col 1","col 2") > z col 1 col 2 1 1 3 2 2 4
Consider the data matrix
1 1 1 2 2 2 3 1 2 2
where in the usual statistical fashion each row represents one subject under study. In this case, say we have asked five people (a) "Do you plan to vote for Candidate X?" and (b) "Did you vote in the last election?" Say the answers for (a), which are Yes, No and Not Sure, are coded 1, 2 and 3, respectively, while for (b) the codes are 1 for Yes and 2 for No. So, for instance, the above data would show that the third person in our sample is not planning to vote for X, and did not vote in the last election.
We can use the table() function to convert this data to contingency table format:
> ct <- matrix(c(1,1,2,3,2,1,2,2,1,2),nrow=5) > ct [,1] [,2] [1,] 1 1 [2,] 1 2 [3,] 2 2 [4,] 3 1 [5,] 2 2 > table(ct[,1],ct[,2]) 1 2 1 1 1 2 0 2 3 1 0
The table shows that we had, for example, one person who said Not Sure to (a) and Yes to (a), and two people who answered No to both questions. Of course, we can add labels to the table and do various statistical analyses on it, not covered here.
The key point is that R functions operate on vectors, in an elementwise fashion. For instance, let's apply the function for rounding to the nearest integer to an example vector y:
> y <- c(1.2,3.9,0.4) > z <- round(y) > z [1] 1 4 0
The point is that the round() function was applied individually to each element in the vector y. In fact, in
> round(1.2) [1] 1
the operation still works, because the number 1.2 is actually considered to be a vector, that happens to consist of a single element 1.2.
Here we used the built-in function round(), but you can do the same thing with functions that you write yourself. See the section titled "R Programming below.
Even operators like + are really functions. For example, the reason why elementwise addition of 4 works here,
> y+4 [1] 5.2 7.9 4.4
is that the + is actually considered a function! Look at it here:
> '+'(y,4) [1] 5.2 7.9 4.4
You can also do "slicing" of arrays, picking out elements with specific indices, e.g.
> y[c(1,3)] [1] 1.2 0.4 > y[2:3] [1] 3.9 0.4
In that second command, the notation m:n means all the integers between m and n inclusive, e.g.
> 3:8 [1] 3 4 5 6 7 8
Note carefully that duplicates are definitely allowed, e.g.
> x <- c(4,2,17,5) > y <- x[c(1,1,3)] > y [1] 4 4 17
Negative subscripts mean that we want to exclude the given elements in our output:
> z <- c(5,12,13) > z[-1] [1] 12 13 > z[-1:-2] [1] 13
In such contexts, it is often useful to use the length() function, which gives the length of the vector:
> z <- c(5,12,13) > z[1:length(z)-1] [1] 5 12
Here is a more involved example of this principle. Suppose we have a sequence of numbers for which we want to find successive differences, i.e. the difference between each number and its predecessor. Here's how we could do it:
> x <- c(12,15,8,11,24) > y <- x[-1] - x[-length(x)] > y [1] 3 -7 3 13
Here we want to find the numbers 15-12 = 3, 8-15 = -7, etc. The expression x[-1] gave us the vector (15,8,11,24) and x[-length(x)] gave us (12,15,8,11). Subtracting these two vectors then gave us the differences we wanted.
We can apply slicing to a matrix instead of just a vector. For example:
> g <- matrix(c(1,2,3,4,5,6),nrow=3,ncol=2) > g [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > t <- g[c(1,3),] > t [,1] [,2] [1,] 1 4 [2,] 3 6
Another idea borrowed from functional programming is filtering of vectors, e.g.
> z <- c(5,2,-3,8) > w <- z[z*z > 8] > w [1] 5 -3 8
Here is what happened above: We asked R to find all the elements of z whose squares were greater than 8, and then form a new vector from them (which we assigned to w).
In fact, what happened above actually occurred at a more fundamental level. Look at this:
> z <- c(5,2,-3,8) > z [1] 5 2 -3 8 > z*z > 8 [1] TRUE FALSE TRUE TRUE
Evaluation of the expression z*z > 8 gave us a vector of booleans! Let's go further:
> z[c(TRUE,FALSE,TRUE,TRUE)] [1] 5 -3 8
So you can see that the evaluation of z[z*z > 8] first produced a vector of booleans, which we then applied in a slicing operation in z.
This example will place things into even sharper focus:
> z <- c(5,2,-3,8) > j <- z*z > 8 > j [1] TRUE FALSE TRUE TRUE > y <- c(1,2,30,5) > y[j] [1] 1 30 5
We may just want to find the positions within z at which the condition occurs. We can do this using which():
> which(z*z > 8) [1] 1 3 4
Also, in our student exam scores example above, we could use the expression
> goodexam2 <- testscores[testscores$Exam2 >= 62]
Yes, we need the testscores$ prefix there on Exam2. We can get around this if we use attach().
Be careful to do it right. Here is another example:
> w <- matrix(1:9,nrow=3,ncol=3) > w [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > w[w[,2] > 4,] [,1] [,2] [,3] [1,] 2 5 8 [2,] 3 6 9
Do you see what happened in that last action? The expression w[,2] > 4 gave us the vector c(F,T,T). Thus the overall expression was w[c(F,T,T),], meaning to choose the second and third rows of w. Without that final comman, indicating rows, R would have given us an error message.
One can also use the subset() function for these kinds of tasks:
> goodexam2 <- subset(testscores,Exam2 >= 62)
Note that we do not need to qualify the name Exam2 when using subset().
I noted in our introductory section, Why You Should Use R, that using the nrow() function in conjunction with filtering provides a way to obtain a count of records satisfying various conditions. If you just want the count and don't want to create a new table, you should use this approach.
You can also use this to selectively change elements of a vector, e.g.
> x <- c(1,3,8,2) > x[x > 3] <- 0 > x [1] 1 3 0 2
The form is
ifelse(some_vector_filtering_expression,b,c)
where some_vector_filtering_expression is of the form in section titled "Filtering" above, and b and c are constants.
For example, say we have a 3x2 matrix z, whose second column we wish to fill as follows: For each element in the first column, if it is at least 2, set the corresponding element in the second column to 1; otherwise set that element to 0. Here's how we could do it:
> z <- matrix(nrow=3,ncol=2) > z[,1] <- c(1,2,3) > z [,1] [,2] [1,] 1 NA [2,] 2 NA [3,] 3 NA > z[,2] <- ifelse(z[,1] >= 2,1,0) > z [,1] [,2] [1,] 1 0 [2,] 2 1 [3,] 3 1
This is not just for compactness of code, but for speed. If speed is an issue, such as when working with large data sets or long-running simulations, one must avoid explicit loops as much as possible, because R can do them a lot faster than you can.
Use the apply() function. In its simpler form, arguments are the matrix to be applied to, the dimension--1 for rows, 2 for columns--and the function to be applied.
For example, here we apply the built-in R function mean() to each column of a matrix z.
> z [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > apply(z,2,mean) [1] 2 5
Here is an example of working on rows, using our own function:
> f <- function(x) x/c(2,8) > y <- apply(z,1,f) > y [,1] [,2] [,3] [1,] 0.5 1.000 1.50 [2,] 0.5 0.625 0.75
Note that if the function to be applied returns a vector of k components, the result of apply() will have k rows. You can use the matrix transpose function t() to change it.
As you can see, the function to be applied needs at least one argument, which will play the role of one row or column in the array. In some cases, you will need additional arguments, which you can place following the function name in your call to apply().
The seq() ("sequence") generates an arithmetic sequence, e.g.:
> seq(5,8) [1] 5 6 7 8 > seq(12,30,3) [1] 12 15 18 21 24 27 30 > seq(1.1,2,length=10) [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Though it may seem innocuous, the seq() function provides foundation for many R operations. See examples in the sections below titled Simulation Programming and "Graphing Explicit Functions".
Note also the : operator:
> 5:8 [1] 5 6 7 8 > 5:1 [1] 5 4 3 2 1
Beware of the operator precedence:
> i <- 2 > 1:i-1 [1] 0 1
The rep() ("repeat") function allows us to conveniently put the same constant into long vectors. The call form is rep(z,k), which creates a vector of k elements, e.g. equal to z. E.g.,
> x <- rep(8,4) > x [1] 8 8 8 8
Yet another influence of the functional programming philosophy is that functions do not change their arguments (unless the result is re-assigned to the argument). Consider, for instance, this:
> x <- c(4,1,3) > y <- sort(x) > y [1] 1 3 4 > x [1] 4 1 3
The point is that x didn't change.
Functions can be used as arguments, assigned, etc. E.g.:
> f1 <- function(a,b) return(a+b) > f2 <- function(a,b) return(a-b) > f <- f1 > f(3,2) [1] 5 > f <- f2 > f(3,2) [1] 1 > g <- function(h,a,b) h(a,b) > g(f1,3,2) [1] 5 > g(f2,3,2) [1] 1
You can view the code for a function (either one you wrote, or one in R), e.g.
> f1 function(a,b) return(a+b)
This is handy if you're using a function that you've written but have forgotten what its arguments are, for instance. It's also useful if you are not quite sure what an R library function does; by looking at the code you may understand it better.
Also, a nice feature is that you can edit functions from within R. In the case above, I could change f1() by typing
> f1 <- edit(f1)
This would open the default editor (which is changeable) to the code for f1(), which I could then edit and save. Note that in this example I am saving the revision back to the same function.
If your data are merely vectors, instead of columns in a data frame, then do not specify the data argument in lm(). For instance:
> x1 <- c(1,2,3,4) > x2 <- c(1,5,5,6) > y <- c(2,7,8,9) > lm(y ~ x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 0.4286 0.4857 1.1429
Instead of lm(), we can use the "old" linear model function, lsfit() ("least squares fit"). This function is sometimes more convenient to use when writing R programs. In our example above, we would call it this way:
> fit1 <- lsfit(testscores$Exam1,testscores$Exam3)
We might follow up with
> ls.print(fit1)
To regress Exam3 against Exam1 and Exam2, we would type:
> fit12 <- > lsfit(cbind(testscores$Exam1,testscores$Exam2),testscores$Exam3)
The first argument to lsfit() is a matrix of the predictor variable data, with the i^{th} row consisting of observation number i in our sample. So, here the call to cbind() combines the two vectors into one matrix, whose first column is the first vector and the second column is the second vector. See our section titled Adding More Rows and Columns for more information on this function.
The "A ~ B+C+..." notation for specifying a response variable A and predictor variables B, C, ... is usually nice, but sometimes troublesome if you have many predictors or you are writing R programs. But alternatively you can specify the predictors as a matrix, e.g.
> z <- matrix(c(1,2,3,8,1,2,3,4,0,3,9,1),nrow=4,ncol=3) > lm(z[,1] ~ z[,2]+z[,3]) Call: lm(formula = z[, 1] ~ z[, 2] + z[, 3]) Coefficients: (Intercept) z[, 2] z[, 3] -1.6779 2.4899 -0.3221 > lm(z[,1] ~ z[,c(2,3)]) Call: lm(formula = z[, 1] ~ z[, c(2, 3)]) Coefficients: (Intercept) z[, c(2, 3)]1 z[, c(2, 3)]2 -1.6779 2.4899 -0.3221
In that second call to lm(), we just specified the matrix consisting of the second and third columns of z, implicitly saying that these columns are our predictors.
Logistic regression is handled as a special case of the generalized linear model. Say our data is in a matrix x, whose first column consists of 0s and 1s and forms the response variable. Say the second column is the sole predictor variable. Then we would run
> lgt <- glm(xl[,1] ~ xl[,2], family=binomial)
As with linear regression, one ironically gets more information by requesting the summary (!), e.g.
> summary(lgt)
As you proceed through an R session, R will record which commands you submit. And as you long as you answer yes to the question "Save workspace image?" put to you when you quit the session, R will save all the objects you created in that session, and restore them in your next session. You thus do not have to recreate the objects again from scratch if you wish to continue work from before.
The ls() command will list all of your current objects.
To remove objects you no longer need, use rm(). For instance,
> rm(a,b,x,y,z,uuu)
would remove the objects a, b.
One of the named arguments of rm() is list, which makes it easier to remove multiple objects. For example,
> rm(list = ls())
would assign all of your objects to list, thus removing everything.
An object consists of a gathering of various kinds of information, with each kind being called an attribute. The names() function will tell us the names of the attributes of the given object. For a data frame, for example, these will be the names of the columns. For a regression object, these will be "coefficients", "residuals" and so on. Calling the attributes() function will give you all this, plus the class of the object itself.
Recall that when one does almost any complex operation in R, the results are returned packaged as an object. Say for instance we perform a linear regression analysis by calling lm(), assigning the result to lmout. Then the the coefficients, residuals etc. are all accessible as components of this object lmout, in lmout$coefficients, lmout$residuals etc.
Suppose we then want to use one of those components as numerical input into another operation. Say for instance we wish to use the coefficients to predict the Y values at a few points of particular interest, say W[1],...,W[25]. Then we may not be able to simply take their inner products with the coefficient vector, e.g.
> W[1] %*% lmout$coefficients
may be rejected by R. The problem is that lmout$coefficients consists not only of the coefficients themselves, but also the names of the corresponding predictor variables.
To remove the latter, we use the as.numeric() function:
> W[1] %*% as.numeric(lmout$coefficients)
And if W[1] itself had names attached to it, we'd need to use as.numeric() on it too.
As mentioned in my introduction, R is rather polymorphic, in the sense that the same function has different operation for different classes. One can apply plot(), for example, to many types of objects, getting an appropriate plot for each. The same is true for print() and summary().
In this manner, we get a uniform interface to different classes. So, when someone develops a new R class for others to use, we can try to apply, say, summary() and reasonably expect it to work. This of course means that the person who wrote the class, knowing the R idiom, would have had the foresight of writing such a function in the class, knowing that people would expect one.
R has functions available for various aspects of most of the famous statistical distributions Prefix the name by d for the density, p for the c.d.f., q for quantiles and r for simulation. The suffix of the name indicates the distribution, such as norm, unif, chisq, binom, exp, etc.
E.g. for the chi-square distribution:
> mean(rchisq(1000,different=2)) find mean of 1000 chi-square(2) variates [1] 1.938179 > qchisq(0.95,1) find 95th percentile of chi-square(2) [1] 3.841459
There are also functions for the normal distribution, the t, the binomial. For instance, dnorm() gives the normal density, pnorm() gives the normal cdf and rnorm() generates normally-distributed random variates.
See the online help pages for details, e.g. by typing
> help(pnorm)
The usual exp(), log(), log10(), sqrt() etc. are available, as well as min(), max(), sum(), sort(), round(), floor() etc.
Also, some special math functions are described when you invoke help() with the argument Arithmetic.
R is a full programming language, similar to scripting languages such as Perl and Python. One can define functions, use constructs such as loops and conditionals, etc.
In the following example, we define a function oddcount() while in R's interactive mode, and then call the function on a couple of test cases. The function is supposed to count the number of odd numbers in its argument vector.
# comment: counts the number of odd integers in x > oddcount <- function(x) { + k <- 0 + for (n in x) { + if (n %% 2 == 1) k <- k+1 + } + return(k) + } > oddcount(c(1,3,5)) [1] 3 > oddcount(c(1,2,3,7,9)) [1] 4
Here is what happened: We first told R that we would define a function oddcount() of one argument x. The left brace demarcates the start of the body of the function. We wrote one R statement per line. Since we were still in the body of the function, R reminded us of that by using + as its prompt instead of the usual >. After we finally entered a right brace to end the function body, R resumed the > prompt.
The body of a for, if or similar statement does not need braces if it consists of a single statement.
The line
+ for (n in x) {
will again instantly be recognized by Python programmers. It of course means that there will be one iteration of the loop for each component of the vector x, with n taking on the values of those components. In other words, in the first iteration, n = x[1], in the second iteration n = x[2], etc.
Looping with while and repeat are also available, complete with break, e.g.
> i <- 1 > while(1) { + i <- i+4 + if (i > 10) break + } > i [1] 13
(Of course, break can be used with for too.)
By the way, you don't need the return(). The last value computed will be returned by default. In the example above, instead of writing
return(k)
we could simply write
k
This is true for nonscalars too (recall that a scalar is really a one-element vector anyway), e.g.:
> r <- function(x,y) { + c(x+y,x-y) + } > r(3,2) [1] 5 1
The syntax for if-else is like this:
> if (r == 4) { + x <- 1 + y <- 2 + } else { + x <- 3 + y <- 4 + }
See also the ifelse()function discussed in the section, "More on the Functional Programming Nature of R" above.
Local vs. global variables within a function: If a variable zappearing within a function has the same name as a global, it will be treated as local, except that its initial value will be that of the global. Subsequent assignment to it within the function will not change the value of the global. For example:
> u [1] 1 > v [1] 2 > f function(x) { y <- u y <- y + 3 u <- x return(x+y+u+v) } > f(5) [1] 16 > u [1] 1 > v [1] 2 > y Error: object "y" not found
x + y addition x - y subtraction x * y multiplication x / y division x ^ y exponentiation x %% y mod arithmetic x %/% y integer division x == y test for equality x <= y test for less-than-or-equal x >= y test for greater-than-or-equal x & y boolean and x | y boolean or !x boolean negation x %in% y x is a member of the set y
The boolean values are TRUE and FALSE. They can be abbreviated to T and F, but must be capitalized. These values change to 1 and 0 in arithmetic expressions, e.g.
> 1 < 2 [1] TRUE > (1 < 2) * (3 < 4) [1] 1 > (1 < 2) * (3 < 4) * (5 < 1) [1] 0 > (1 < 2) == TRUE [1] TRUE > (1 < 2) == 1 [1] TRUE > 3 %in% 1:5 [1] TRUE
There are set operations, e.g.
> x <- c(1,2,5) > y <- c(5,1,8,9) > union(x,y) [1] 1 2 5 8 9 > intersect(x,y) [1] 1 5 > setdiff(x,y) [1] 2 > setdiff(y,x) [1] 8 9 > setequal(x,y) [1] FALSE > setequal(x,c(1,2,5)) [1] TRUE > 2 %in% x [1] TRUE > 2 %in% y [1] FALSE
You can invent your own operators! Just write a function whose name begins and ends with %. Here is an operator for the symmetric difference between two sets (i.e. all the elements in exactly one of the two operand sets):
> "%sdf%" <- function(a,b) { + sdfxy <- setdiff(x,y) + sdfyx <- setdiff(y,x) + return(union(sdfxy,sdfyx)) + } > x %sdf% y [1] 2 8 9
Try to avoid writing large loops, instead of having R's rich functionality do the work for you. Not only does this save you programming time, it produces faster code, since R's functions have been written for efficiency.
For instance, let's rewrite the function oddcount() that we wrote above:
> oddcount <- function(x) { + return(length(x[x %% 2 == 1])) + }
Let's test it:
> rr <- c(4,7,8 12,16,9) > oddcount(rr) [1] 2
There is no explicit loop in this version of our oddcount(). We used R's vector filtering to avoid a loop, and even though R internally will loop through the array, it will do so much faster than we would with an explicit loop in our R code.
Of course, R's functional programming features, described in our section with that title, provide many ways to help us avoid explicit loops.
Here is an example, finding P(Z < 1) for a N(0,1) random variable Z:
> count <- 0 > for (i in seq(1,1000)) + if (rnorm(1) < 1.0) count <- count + 1 > count/1000 [1] 0.832 > count/1000.0 [1] 0.832
But as noted in the section titled "Writing Efficient R Code", you should try to use R's built-in features for greater speed. The above code would be better written
> x <- rnorm(100000) > length(x[x < 1.0])/100000.0
We achieve an increase in speed at the expense of using more memory, by keeping our random numbers in an array instead of generating and discarding them one at a time. Suppose for example we wish to simulate sampling from an adult human population in which height is normally distributed with mean 69 and standard deviation 2.5 for men, with corresponding values 64 and 2 for women. We'll create a matrix for the data, with column 1 showing gender (1 for male, 0 for female) and column 2 showing height. The straightforward way to do this would be something like
sim1 <- function(n) { xm <- matrix(nrow=n,ncol=2) for (i in 1:n) { d <- rnorm(1) if (runif(1) < 0.5) { xm[i,1] <- 1 xm[i,2] <- 2.5*d + 69 } else { xm[i,1] <- 0 xm[i,2] <- 2*d + 64 } } return(xm) }
We could avoid a loop this way:
sim2 <- function(n) { d <- matrix(nrow=n,ncol=2) d[,1] <- runif(n) d[,2] <- rnorm(n) smpl <- function(rw) { # rw = one row of d if (rw[1] < 0.5) { y <- 1 x <- 2.5*rw[2] + 69 } else { y <- 0 x <- 2*rw[2] + 64 } return(c(y,x)) } z <- apply(d,1,smpl) return(t(z)) }
Here is a quick illustration of the fact that we do gain in performance:
> system.time(sim1(1000)) [1] 0.028 0.000 0.027 0.000 0.000 > system.time(sim2(1000)) [1] 0.016 0.000 0.018 0.000 0.000
Here is a slightly more complicated example, using a classical problem from elementary probability courses. Urn 1 contains 10 blue marbles and eight blue ones. In Urn 2 the mixture is six blue and six yellow. We draw a marble at random from Urn 1 and transfer it to Urn 2, and then draw a marble at random from Urn 2. What is the probability that that second marble is blue? This quantity is easy to find analytically, but we'll use simulation. Here is the straightforward way:
sim3 <- function(nreps) { nb1 = 10 # 10 blue marbles in Urn 1 n1 <- 18 # number of marbles in Urn 1 at 1st pick n2 <- 13 # number of marbles in Urn 2 at 2nd pick count <- 0 for (i in 1:nreps) { nb2 = 6 # 6 blue marbles orig. in Urn 2 # pick from Urn 1 and put in Urn 2 if (runif(1) < nb1/n1) nb2 <- nb2 + 1 # pick from Urn 2 if (runif(1) < nb2/n2) count <- count + 1 } return(count/nreps) # est. P(pick blue from Urn 2) }
But here is how we can do it without loops:
sim4 <- function(nreps) { nb1 = 10 # 10 blue marbles in Urn 1 nb2 = 6 # 6 blue marbles orig. in Urn 2 n1 <- 18 # number of marbles in Urn 1 at 1st pick n2 <- 13 # number of marbles in Urn 2 at 2nd pick u <- matrix(c(runif(2*nreps)),nrow=nreps,ncol=2) simfun <- function(rw,nb1,n1,nb2,ny2,n2) { if (rw[1] < nb1/n1) nb2 <- nb2 + 1 if (rw[2] < nb2/n2) b <- 1 else b <- 0 return(b) } z <- apply(u,1,simfun,nb1,n1,nb2,ny2,n2) return(mean(z)) # est. P(pick blue from Urn 2) }
Here we have set up a matrix u with two columns of U(0,1) random variates. The first column is used for our simulation of drawing from Urn 1, and the second for the drawing from Urn 2. Our function simfun() works on one repetition of the experiment. We have set up the call to apply() to go through all of the nreps repetitions.
Actually, on my machine, this second approach was actually slower. So, one must not assume that using apply() will necessarily speed things up. Note, though, that in a parallel version of R (versions are being developed), apply() would probably be made automatically parallel, and we'd likely get quite a speedup.
If we define a function which we wish to keep and use in future R sessions (aside from it being saved in our workspace when we log out), we can put in a text file and load it using source(). For example, we could put our function oddcount() above into a file, say named oc.r:
# counts the number of odd numbers in x oddcount <- function(x) { k <- 0 for (n in x) { if (n %% 2 == 1) k <- k+1 } return(k) }
Then whenever we wanted to use it from an R session, we could load it by typing
> source("oc.r")
Here's another example, to automate the saving of a graph which is displayed on the screen:
# prints the currently displayed graph, on device number dvnum, to the # file filename; dvnum will typically be 2; filename must be the name of # a PDF file, quoted; it closes the PDF file and restores dvnum as the # current device prpdf <- function(dvnum, filename) { pdf(filename) dvc <- dev.cur() dev.set(dvnum) dev.copy(which=dvc) dev.set(dvc) dev.off() dev.set(dvnum) }
Use R's list structure as you would a C struct or a class in an OOP language--to package information.
One common use is to package return values for functions that return more than one piece of information. The $ symbol is used to designate components of an object. We've seen this above, for instance, in the output of the lm() regression function:
> fit1 <- lm(Exam3 ~ Exam1,data=testscores) > fit1$coefficients (Intercept) Exam1 3.7008841 0.7458898
We can do the same thing in functions we write ourselves. Say for instance the function f() returns a matrix m and a vector v. Then one could write
return(list(mat=m, vec=v))
at the end of the function, and then have the caller access these items like this:
l <- f() m <- l$mat v <- l$vec
In interactive mode, one can print the value of a variable or expression by simply typing the variable name or expression. In batch mode, one can use the print() function, e.g.
print(x)
The argument may be an object.
It's a little better to use cat() instead of print(), as the latter's output is numbered. E.g.
> print("abc") [1] "abc" > cat("abc\n") abc
R includes a number of debugging facilities. They are nowhere near what a good debugging tool offers, but with skillful usage they can be effective.
A more functional debugging package available for R, of course called debug. I will discuss this below.
One of the tools R offers for debugging your R code is debug(). It works in a manner similar to C debuggers such as GDB. Say for example we suspect that our bug is in the function f(). We include code (outside of that function)
debug(f)
Then when f() is executed, you enter the debugger, termed the browser in R. The command prompt will now be something like Browse[1] instead of just >. Then you can invoke various debugging operations, such as:
To turn off debugging for that function, type
> undebug(f)
Note that if you simultaneously have a separate window open in which you are editing your source code, each time you reload it using source(), the debugging status is lost.
If single-stepping is too tedious, an alternative is to place breakpoints, by simply inserting the call
browser()
at the points at which you wish execution to pause.
When browser() is called, you do indeed enter the browser (if you had not already entered it via debug()), and can proceed as described above. Note, though, that if you want to single-step after hitting a breakpoint, you must type n for the first step, though you can just hit the Enter key for subsequent steps. Otherwise, hitting Enter immediately after hitting a breakpoint is equivalent to hitting c.
Note that as described here, the breakpoints are hard-coded into the function's source code. To set them externally, see the trace() function described below.
The trace() function is quite flexible and powerful, though it takes some initial effort to learn. I will discuss some of the simpler usage forms here.
The call
> trace(f,t)
would instruct R to call the function t() every time we enter the function r(). For instance, say we wish to set a breakpoint at the beginning of the function gy(). We could do this by the command
> trace(gy,browser)
This would have the same effect as placing the command browser() in our source code for gy(), but would be quicker and more convenient than inserting such a line, saving the file and rerunning source() to load in the new version of the file.
It would also be quicker and more convenient to undo, by simply running
> untrace(gy)
You can turn tracing on or off globally by calling tracingState(), with the argument TRUE to turn it on, FALSE to turn it off. Recall too that these boolean constants in R can be abbreviated T and F.
Say your R code crashes when you are not running the debugger. There is still a debugging tool available to you after the fact: You can do a "post mortem" by simply calling traceback().
You can get a lot more if you set R up to dump frames on a crash:
> options(error=dump.frames)
If you've done this, then after a crash run
> debugger()
You will then be presented with a choice of levels of function calls to look at. For each one that you choose, you can take a look at the values of the variables there.
The debug package provides a more usable debugging interface than R's built-in facilities do. It features a pop-up window in which you can watch your progress as you step through your source code, gives you the ability to easily set breakpoints, etc.
It requires another package, mvbutils, and the tcl/tk scripting and graphics system. The latter is commonly included in Linux distributions, and is freely downloadable for all the major platforms. It suffers from a less-than-perfect display, but is definitely worthwhile.
Choose an installation directory, say /MyR. Then install mvbutils, then debug:
> install.packages("mvbutils","/MyR") > install.packages("debug","/MyR")
R version 2.5.0, I found that a bug in R caused the debug package to fail. I then installed the patched version of 2.5.0, and debug worked fine. On one machine, I encountered a Tcl/Tk problem when I tried to load debug. I fixed that (I was on a Linux system) by setting the environment variable, in my case by typing
% setenv TCL_LIBRARY /usr/share/tcl8.4
Each time you wish to use debug, load it by executing
> .libPaths("/MyR") > library(debug)
Or, place these in an R startup file, say .Rprofile in the directory in which you want these commands to run automatically.
Now you are ready to debug. Here are the main points:
> mtrace(f)
Do this for each function at which you want a breakpoint.
D(1)> bp(12)
D(1)> ?bp
If you're doing anything with random numbers, you'll need to be able to reproduce the same stream of numbers each time you run your program during the debugging session. To do this, type
> set.seed(8888) # or your favorite number as an argument
If there are R commands you would like to have executed at the beginning of every R session, you can place them in a file .Rprofile either in your home directory or in the directory from which you are running R. The latter is searched for such a file first, which allows you to customize for a particular project.
Other information on startup files, e.g. the file .RData which records your workspace if you request it to be saved when you exit R, is available by typing
help(.Rprofile)
The help is piped through the pager.
You can put any set of R commands in a file, say a.r, and then have them automatically executed by typing
source("a.r")
R uses packages to store groups of related pieces of software. (See Note LIBRARY below.) The libraries are visible as subdirectories of your library directory in your R installation tree, e.g. /usr/lib/R/library. The ones automatically loaded when you start R include the base subdirectory, but in order to save memory and time, R does not automatically load all the packages. You can check which packages are currently loaded by typing
> .path.package()
If you need a package which is in your R installation but not loaded into memory yet, you must request it. For instance, suppose you wish to generate multivariate normal random vectors. The function mvrnorm() in the package MASS does this. (See "For Further Information" below for an explanation of how you might find that this is the function we want.) So, load the library:
> library(MASS)
Then mvrnorm() will now be ready to use. (As will be its documentation. Before you loaded MASS, "help(mvrnorm)" would have given an error message).
However, the package you want may not be in your R installation. One of the big advantages of open-source software is that people love to share. Thus people all over the world have written their own special-purpose R packages, placing them in the CRAN repository. That is accessible from R home page, but it's easier to use the install.packages() function.
As an example, suppose you wish to use the mvtnorm package, which computes multivariate normal cdf's and other quantities. Choose a directory in which you wish to install the package (and maybe others in the future), say /a/b/c. Then at the R prompt, type
> install.packages("mvtnorm","/a/b/c/")
This will cause R to automatically go to CRAN, download the package, compile it, and load it into a new directory /a/b/c/mvtnorm.
You do have to tell R where to find that package, though, which you can do via the .libPaths() function:
> .libPaths("/a/b/c/")
This will add that new directory to the ones R was already using. A call to .libPaths() again, without an argument, will show you a list of all the places R now will look at for loading a package when requested.
R includes a few real data sets, for use in teaching or in testing software. Type the following:
> library(utils) > data > help(data)
Here data is contained within the utils package. We load that package, and use help() to see what's in it, in this case various data sets. We can load any of them but typing its name, e.g.
> LakeHuron
R has a number of string manipulation utilities, such as paste(), sprintf(), substr() and strsplit. I discuss a couple of them here.
Suppose I wish to create five files, q1.pdf through q5.pdf, consisting of histograms of 100 random N(0,i^{2}) variates. I could execute the code
for (i in 1:5) { fname <- paste("q",i) pdf(fname) hist(rnorm(100,sd=i)) dev.off() }
The paste() function concatenates the string "q" with the string form of i. For example, when i = 2, the variable fname will be "q 2".
But that wouldn't quite work, as it would give me filenames like "q 2.pdf". On Unix systems, filenames with embedded spaces create headaches.
Instead, I could use sprintf(), which works like the C-language function of the same name:
for (i in 1:5) { fname <- sprintf("q%d.pdf",i) pdf(fname) hist(rnorm(100,sd=i)) dev.off() }
Since even many C programmers are unaware of the sprintf() function, some explanation is needed. This function works just like printf(), except that it "prints" to a string, not to the screen. Here we are "printing" to the string fname. What are we printing? The function says to first print "q", then print the character version of i, then print ".pdf". When i = 2, for instance, we print "z2.pdf" to fname.
For floating-point quantities, note also the difference between %f and %g formats:
> sprintf("abc%fdef",1.5) [1] "abc1.500000def" > sprintf("abc%gdef",1.5) [1] "abc1.5def"
During a session, you can scroll back and forth in your command history by typing ctrl-p and ctrl-n. You can also use the history() command to list your more recent commands (which will be run through the pager). Set the named argument max.show=Inf if you want to see all of them.
This displays material one screen at a time. It is automatically invoked by some R commends, such as help(), and you can invoke it yourself on lengthy output. For instance, if you have a long vector x and wish to display it one screen at a time, then instead of typing
> x
type
> page(x)
Set the Hit q to quit the pager, type "/abc" to search for the string "abc" in the pager output, etc.
Sometimes it's preferable to automate the process of running R. For example, we may wish to run an R script that generates a graph output file, and not have to bother with manually running R. Here's how it could be done. Consider the file z.r:
pdf("xh.pdf") # set graphical output file hist(rnorm(100)) # generate 100 N(0,1) variates and plot their histogram dev.off() # close the file
We could run it automatically by simply typing
R CMD BATCH --vanilla < z.r
The --vanilla option tells R not to load any startup file information, and not to save any.
If you are not sure which of several approaches to use to get the fastest R code, you can time time with the function system.time().
The sample() function chooses random items from a set of objects, with or without replacement. For example, here our objects are matrix rows:
> x [,1] [,2] [1,] 11 12 [2,] 21 22 [3,] 31 32 > x[sample(c(1,2,3),2), ] [,1] [,2] [1,] 31 32 [2,] 11 12 > x[sample(c(1,2,3),2), ] [,1] [,2] [1,] 21 22 [2,] 11 12 > x[sample(c(1,2,3),2), ] [,1] [,2] [1,] 31 32 [2,] 11 12 > x[sample(c(1,2,3),2), ] [,1] [,2] [1,] 11 12 [2,] 21 22
Look carefully at what is happening here. The call
sample(c(1,2,3),2)
tells R to take a sample of size 2 without replacement (the default) from c(1,2,3). The first time we did this, the result of that call was the vector (3,1). So,
x[sample(c(1,2,3),2), ]
was x[c(3,1), ], i.e. the 2x2 matrix consisting of the third row and then the first row of x.
The bootstrap is a resampling method for performing statistical inference in analytically intractable situations. If we have an estimator but no standard error, we get one by resampling from our sample data many times, calculating the estimator each time, and then taking the standard deviation of all those generated values. You could just use sample() for this, but R has a package, boot, which automates the procedure for you.
To use the package, you must first load it:
> library(boot)
Inside that package is a function, boot(), which will do the work of bootstrapping.
Suppose we have a data array y of length 100, from which we wish to estimate a population median, using y, and have a standard error as well. We could do the following.
First we must define a function which defines the statistic we wish to compute, which in this case is the median. This function will be called by boot() (it is named statistic() in the list of boot()'s formal parameters). We could define it as follows:
> mdn <- function(x,i) { + return(median(x[i])) + }
It may seem to you that all this is doing is to call R's own median() function, and you may wonder why we need to define our own new function. But it is definitely needed, with the key being that second parameter, i. When we call boot(), the latter will generate 100 indices, sampled randomly with replacement from {1,...,100}, where recall 100 is our sample size here. R will then set i to this set of random indices when it calls mdn(). For example R might generate i to be a 100-element vector [3,22,59,3,14,6,...] Of course, in boot()'s call to mdn(), the formal parameter x is our vector y here. So, the expression x[i] means y[c(3,22,59,3,14,6,...)], in other words the vector [y[3],y[22],y[59],y[3],y[14],y[6],...]--exactly the kind of thing the bootstrap is supposed to do.
To then call boot(), we do something like
> b <- boot(y,mdn,R=200)
This tells R to apply the bootstrap to the data y, calculating the statistic mdn() on each of 200 resamplings of y. The output looks like this:
> b ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = y, statistic = mdn, R = 200) Bootstrap Statistics : original bias std. error t1* 5 0.2375 1.708637
Normally, we would assign the result of boot() to an object, as we did with b above. Among the components of that object are b$t, which is a matrix whose i-th row gives the value of the statistic as found on the i-th bootstrap resampling, and b$t0, which is the value of our statistic on the original data set.
A somewhat more sophisticated example (they can become quite complex) would be that in which our data is a data frame, say d consisting of 100 rows of two columns. We might be doing regression of the first column against the second. An index i here would tell boot() that the basic datum is a row, d[i,]. We could set our statistic() function to
dolm <- function(fulldata,i) { bootdata <- fulldata[i,] lmout <- lm(bootdata[,1]~bootdata[,2]) return(lmout$coef) }
(I've put in some extra steps for clarity.)
Again, when dolm() is called, i will be a set of indices randomly generated by boot(), and fulldata will be our original data set d. The expression fulldata[i,] then gives us the randomly generated set of rows of d, based on the randomly generated indices i.
Our call to boot() could then be
> boot(d,dolm,R=500)
Note that this example also illustrates the fact that boot() can indeed handle vector-valued statistics.
Further details are described in an article by the author of boot() in the December 2002 issue of R News.
I must confess to being a non-spreadsheet user. I use R instead! But I often work with government databases, and they tend to release these as either straight Excel files, or as .csv files. (In Excel, one can export to this format.) R is capable of reading .csv files. Here's how:
For concreteness, suppose our file is named d.csv. The first record of the .csv file has the names of the variables. If a name contains a blank, the name must be quoted. Again, that will already be taken care of by the spreadsheet or whatever generated the file in the first place, but I mention it because you will need to use the quotes in your R commands. Say for instance the file has variables x, y and "abc def".
To read your dataset into your R space, issue the command
d <- read.csv("d.csv")
where of course my choice of d as the name for my R table was arbitrary. Then work with R as usual. For example, say we want to see the means of x and "abc def". We type
mean(d$x) mean(d$"abc def")
Another possible issue is that the data may all be strings, so that you have to convert to numeric values. Use the as.numeric() function for this purpose, e.g.
d$xn <- as.numeric(d$x)
This would create a new variable, d$xn, consisting of the numbers converted from d$x.
If you have a .dbf file, the easiest thing to do is run the dbf2txt program, available at http://www.usf.uni-osnabrueck.de/~fkoorman/software/dbftools.en.html , to convert the file to plain ASCII text.
If you want an introductory overview of R, type
> help.start()
R will then invoke a default Web browser. If you want another browser used, say Lynx, type
> help.start(browser="lynx")
For specific online help, invoke help(). For example, type
> help(sd)
or
?sd
to get information on the sd() function.
The example() function will actually run the examples shown in the output of help(). This can be especially useful for graphics functions.
You can use the function help.search() to do a "Google"-style search through R's documentation in order to determine which function will play a desired role. For instance, in the section titled "Packages (Libraries)" above, we needed a function to generate random variates from multivariate normal distributions. To determine what function, if any, does this, we could type
> help.search("multivariate normal")
getting a response which contains this excerpt:
mvrnorm(MASS) Simulate from a Multivariate Normal Distribution
This tells us that the function mvrnorm() will do the job, and it is in the package MASS.
To "browse," go to the place in your R directory tree where the base is stored. For Linux, for instance, that is likely /usr/lib/R/library/base or some similar location. The file CONTENTS in that directory gives brief descriptions of each entity.
For more information on graphics, the command ?par will give you documentation on the various graphics parameters, but it is rather overwhelming. There are entire books on R/S+ graphics. But you can learn a lot from tutorials on the Web. See below for links to some of them.
There are a number of R tutorials and manuals which have been written, many of which are linked to on the R Web page (click on Manuals | Contributed Documentation). Here are some that I find useful:
Various R search engines are listed in the the R home page; click on Search. One of them, RSeek, is one I use.
If you have Fedora Linux, life is really simple. Just type
$ yum install R
You can also probably do something like this in Ubuntu Linux, etc.
There are precompiled binaries for Windows, Linux and MacOS X at the R home page. Just click on Download (CRAN).
On Linux machines, you can compile the source yourself by using the usual
configure make make install
sequence. If you want to install to a nonstandard directory, say /a/b/c, run configure as
configure --prefix=/a/b/c
There is a very popular package which allows one to run R (and some other statistical packages) from within emacs, ESS. I personally do not use it, but it clearly has some powerful features for those who wish to put in a bit of time to learn the package. As described in the R FAQ, ESS offers R users:
R support contains code for editing R source code (syntactic indentation and highlighting of source code, partial evaluations of code, loading and error-checking of code, and source code revision maintenance) and documentation (syntactic indentation and highlighting of source code, sending examples to running ESS process, and previewing), interacting with an inferior R process from within Emacs (command-line editing, searchable command history, command-line completion of R object and file names, quick access to object and search lists, transcript recording, and an interface to the help system), and transcript manipulation (recording and saving transcript files, manipulating and editing saved transcripts, and re-evaluating commands from transcript files).
R functions can be called from Python, and vice versa, using the RPy package. It is extremely easy to set up and use.
Note LIBRARY: This is one of the very few differences between R and S. In S, packages are called libraries, and many of the functions which deal with them are different from those in R.