
\documentclass[11pt]{article}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{0.0in}
% \setlength{\headheight}{0in}
% \setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.0in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.15in}

\usepackage{times}
\usepackage{url}

\usepackage{listings}

\usepackage{graphicx}

\begin{document}

\title{Introduction to ggplot2}

\author{N. Matloff}

\date{November 26, 2011}

\maketitle

\section{Introduction}

Hadley Wickham's {\bf ggplot2} package is a very popular alternative to
R's base graphics package.  (Others include {\bf lattice}, {\bf ggobi}
and so on.)  

The {\bf ggplot2} pacakge is an implementation of the ideas in the book,
{\it The Gramma of Graphics}, by Leland Wilkison, whose goal was to set
out a set of general unifying principles for the visualization of data.
For this reason, {\bf ggplot2} offers a more elegant and arguably more
natural approach than does the base R graphics package.

The package has a relatively small number of primitive functions, making
it relatively easy to master.  But through combining these functions in
various ways, a very large number of types of graphs may be produced.

The package is considered especially good in setting reasonable default
values of parameters, and much is done without the user's asking.
Legends are automatically added to graphds, for instance.

The package is quite extensive (only a few functions, but lots of
options), and thus this document is merely a brief introduction.

\section{Installation and Use}

Download and install {\bf ggplot2} with the usual {\bf
install.packages()} function, and then at each usage, load via {\bf
library()}.  Here's what I did on my netbook:

\begin{lstlisting}
# did once:
> install.packages("ggplot2","/home/nm/R")
# do each time I use the package (or use .Rprofile)
> .libPaths("/home/nm/R")
> library(ggplot2)
\end{lstlisting}

\section{Basic Structures}

One operates in the following pattern:

\begin{itemize}

\item One begins with a call to {\bf ggplot()}, which creates an R S3
object of class {\bf "ggplot"},\footnote{The {\bf ggplot2} package also
offers a ``quick and dirty'' wrapper {\bf qplot()} (``quick plot''), but
this does not allow one to make full usage of the package's
capabilities.} with a call

\begin{lstlisting}
> p <- ggplot(yourdataframe)
\end{lstlisting}

or 

\begin{lstlisting}
> p <- ggplot(yourdataframe,aes(yourargs))
\end{lstlisting}

The resulting object {\bf p} consists of a component named {\bf data},
and other components containing information about the plot.  Note that
at this point, though, there is no plot, and nothing is displayed.

\item One adds features to the plot via the + operator, which of course
is an overloaded version of R's built-in +, internally the function
\lstinline{"+.ggplot"}.  One can do this repeatedly, e.g. to superimpose 
several curves on the same graph.

\item The function {\bf aes()} (``aesthetics'') is used to map
your data variables to graph attributes.  This could be, for example, to
specify which variable goes on the X-axis and which one in the Y-axis in
a scatterplot, or for instance to indicate which variable will determine
the color of any given point in the graph.

One can set attributes in this way at various levels:

   \begin{itemize}

   \item We can set attributes for the entire graph by calling {\bf
   aes()} within our call to {\bf ggplot()}, e.g. to specify which
   variables from our dataset we wish to plot.

   \item We can set attributes specific to one of the + actions,
   such as specifying the color when we add a line to the picture.

   \end{itemize}

So for instance we could use {\bf aes()} to specify our data variables
either when we call {\bf ggplot()}, so our data will apply to all
operations, or when we call, say, {\bf geom\_point()}, to indicate data
variables for this specific operation.

\end{itemize}

There are various types of objects that can be used as the second
operand for the +.  Examples are:

\begin{itemize}

\item {\bf geoms} (``geometrics''):  Geometric objects to be drawn, such
as points, lines, bars, filled polygons and text.

\item {\bf position adjustments}:  For instance, in a bar graph, this
controls whether bars should be side by side, or stacked on top of each
other.

\item {\bf facets}:  Specifications to draw many graphs together, as
panels in a large graph.  You can have rows of panels, columns of
panels, and rows and columns of panels!

\item {\bf themes}:  Don't like the gray background in a graph?  Want
nicer labeling, etc.?  You can set each of these individually, but one
of the built-in themes, or a user-contribute one, may prove to be to
your liking, or you can write one that you anticipate using a lot.

\end{itemize}

\section{Example:  Census Data}

The data set here consists of programmers (software engineers, etc.) and
electrical engineers in Silicon Valley, in the 2000 Census.  I've
removed those with less than a Bachelor's degree.  The R object was a
data frame named {\bf pm}.

I first ran

\begin{lstlisting}
p <- ggplot(pm)
\end{lstlisting}

to set up the {\bf ggplot} object.  Next, I type

\begin{lstlisting}
p + geom_histogram(aes(HiDeg))
\end{lstlisting}

which produced a histogram of the highest-degree values of the workers:

\includegraphics[bb=0 0 504 504,width=3.5in]{edhist.pdf}

Note that the + operation yields a new object of class {\bf "ggplot"}.
Since the print function for that class actually plots the graph, the
graph did appear on the screen.  I could have saved the new object in a
variable if needed.

I then decided to do a scatter plot of salary versus age:

\begin{lstlisting}
> p + geom_point(aes(x=Age,y=WgInc))
\end{lstlisting}

Note the role of {\bf aes()} here; I used it to tell {\bf geom\_point()}
which of my data variables would correspond to the X- and Y-axes.

This gave me this graph:

\includegraphics[bb=0 0 504 504,width=3.5in]{ageinc.pdf}

(As is often the case with large data sets, the points tend to ``fill
in'' entire regions, one solution to which is to graph a random subset
of the data, not done here.)

However, I wanted to separate the points according to highest degree
level:

\begin{lstlisting}
> p + geom_point(aes(x=Age,y=WgInc,color=HiDeg))
\end{lstlisting}

Here I have three data variables informing {\bf aes()}:  Age, wage
income and highest degree.  The argument {\bf color} here means that I
want the degree to be used for color coding the points:\footnote{Note
that if I had wanted the same color to be set for {\it all} points, I'd
set the {\bf color} option \underline{outside} the {\bf aes()} call, as
my second argument to {\bf geom\_point()}.}

\includegraphics[bb=0 0 504 504,width=3.5in]{ageinccolor.pdf}

Note the legend that was automatically included on the right.

Since some people might be viewing a black-and-white version of this
document, I ran the command again, specifying point shape instead of
point color:

\begin{lstlisting}
p + geom_point(aes(x=Age,y=WgInc,shape=HiDeg))
\end{lstlisting}

Here {\bf ggplot2} decided to use a circle, a triangle and a square to
represent Bachelor's, Master's and PhD workers:

\includegraphics[bb=0 0 504 504,width=4.5in]{ageincshape.pdf}

Finally, since I'm interested in age discrimination in the industry, I
decided to restrict my graph to those over age 40.  The {\bf ggplot2}
package cleverly exploits the R {\bf subset()} function, allowing me to
write

\begin{lstlisting}
p %+% subset(pm,Age > 40) + geom_point(aes(x=Age,y=WgInc,color=HiDeg))
\end{lstlisting}

The new operator {\bf \%+\%} is mapped to {\bf "+.ggplot"}.  The result
was

\includegraphics[bb=0 0 504 504,width=4.5in]{40ageinccolor.pdf}

Look at all those 0-income PhDs!  (There was also a business income
variable, which I did not pursue here, so maybe some had high incomes
after all.)

\section{For Further Information}

Just plugging ``ggplot2 tutorial,'' ``ggplot2 introduction,'' ``ggplot2
examples'' and so on into your favorite search engine will give you tons
of information.

Hadley's book, {\it ggplot2: Elegant Graphics for Data Analysis}, is of
course the definitive source, but also try his pictorial reference
manual, at \url{http://had.co.nz/ggplot2/}.

\end{document}

pm <- pums[pums$WgInc < 300000,]
pm$Mas <- pm$Educ == 14
pm$Bach <- pm$Educ == 13
pm$PhD <- pm$Educ == 16
pm$EdCode <- ifelse(pm$Bach,1,ifelse(pm$Mas,2,3))
pm$HiDeg <- ifelse(pm$EdCode == 1,"BS",ifelse(pm$EdCode == 2,"MS","PhD"))


