Norm Matloff's Big Data Visualization Tools
(Here I define Big Data rather generally. Any data set that is large
enough to fill major portions of the screen when the points are
plotted counts as Big from my point of view.)
I've developed a new graphical package for R, BDGraphs, which you can
download here, or in individual files form,
here.
(In order to avoid confusion with an unrelated package BDGraph on CRAN,
I will soon change the name of my package to BigNGraphs.)
It consists of some novel tools for
visualization of large data sets. They are computationally intensive, but use parallel processing to greatly
reduce the workload.
Here are the main tools (best understood by clicking on the
Examples link):
- freqparcoord(): A novel approach to
parallel coordinates. The big problem with parallel coordinates is
screen clutter. The approach here addresses that problem by plotting
only "typical" lines, defined as those that have the highest
estimated multivariate density.
- boundary(): Group comparison via comparison of
regression boundary curves. Nonparametrically estimates the
regression function of Z on X and Y, and plots the boundary curve
separating the points (X,Y) having higher-than-age condition mean Z,
and the points having lower-than-average values. By plotting one
curve for each defined subgroup, this function enables exploration of
the interaction of the groups and X, Y and Z.
- ratioest() Comparison of 2 groups via the ratio
of their nonparametrically estimated regression functions of Z on
X and Y.
- resid(): Classical regression model residual
analysis, but with a new twist. Comparing a parametric fit to a
nonparametric estimate of the regression of Z on X and Y,
identifies the most frequent positive and negative residuals at each
(X,Y) point, identifying regions of over/underfitting.
All tools here use nonparametric curve estimation methods, which may be
computationally intensive, so that the package offers parallel
computation, on either multicore machines or clusters.
My JSM talk is
here, and the full paper is
here.
Note:
No warranties made of any kind regarding the software or methodology.