# Norm Matloff's Big Data Visualization Tools

(Here I define Big Data rather generally. Any data set that is large enough to fill major portions of the screen when the points are plotted counts as Big from my point of view.)

I've developed a new graphical package for R, BDGraphs, which you can download here, or in individual files form, here. (In order to avoid confusion with an unrelated package BDGraph on CRAN, I will soon change the name of my package to BigNGraphs.) It consists of some novel tools for visualization of large data sets. They are computationally intensive, but use parallel processing to greatly reduce the workload.

Here are the main tools (best understood by clicking on the Examples link):

• freqparcoord(): A novel approach to parallel coordinates. The big problem with parallel coordinates is screen clutter. The approach here addresses that problem by plotting only "typical" lines, defined as those that have the highest estimated multivariate density.
• boundary(): Group comparison via comparison of regression boundary curves. Nonparametrically estimates the regression function of Z on X and Y, and plots the boundary curve separating the points (X,Y) having higher-than-age condition mean Z, and the points having lower-than-average values. By plotting one curve for each defined subgroup, this function enables exploration of the interaction of the groups and X, Y and Z.
• ratioest() Comparison of 2 groups via the ratio of their nonparametrically estimated regression functions of Z on X and Y.
• resid(): Classical regression model residual analysis, but with a new twist. Comparing a parametric fit to a nonparametric estimate of the regression of Z on X and Y, identifies the most frequent positive and negative residuals at each (X,Y) point, identifying regions of over/underfitting.

All tools here use nonparametric curve estimation methods, which may be computationally intensive, so that the package offers parallel computation, on either multicore machines or clusters.

My JSM talk is here, and the full paper is here.

Note: No warranties made of any kind regarding the software or methodology.