Open Textbook:
From Algorithms to Z-Scores: Probabilistic
and Statistical Modeling in Computer Science
OVERVIEW:
The materials here form a textbook for a course in mathematical
probability and statistics for computer science students.
"Why is this course different from all other courses?"
- Computer science examples are used throughout, in areas such as:
computer networks; data and text mining; computer security; remote
sensing; computer performance evaluation; software engineering; data
management; etc.
- The R statistical/data manipulation language is used throughout.
- Throughout the units, mathematical theory and applications are
interwoven, with a strong emphasis on modeling: What do
probabilistic models really mean, in real-life terms? How does one
choose a model? How do we assess the practical usefulness of models?
- There is considerable discussion of the intuition involving
probabilistic concepts. However, all models and so on are described
precisely in terms of random variables and distributions.
The materials are continuously evolving, with new
examples and topics being added.
Prerequisites: The student must know calculus, basic matrix
algebra, and have skill in programming.
LICENSING:

This
work is licensed under a Creative
Commons Attribution-No Derivative Works 3.0 United States License.
Copyright is retained by N. Matloff in all non-U.S. jurisdictions,
but permission to use these materials in teaching is still granted,
provided the licensing information here is displayed in each unit. I would
appreciate being notified if you use this book for teaching, just so
that I know the materials are being put to use, but this is not required.
CHAPTERS:
-
Discrete probability: Notion of a repeatable experiment; basic
probability computations; introduction to simulation; combinatorics;
random variables; independence; expected value; notion of a
distribution; notion of a parametric family of distributions; geometric,
binomial, Poisson and negative binomial parametric families; a
cautionary tale on independence; examples from computer networks, data
mining.
-
Continous probability: Notion of continuous random variables as a
useful idealization; intuitive aspects of density functions; uniform,
normal, chi-square, exponential and gamma parametric families of
densities; Central Limit Theorem; lifetime models: memoryless property
of exponential distributions, hazard functions; a cautionary tale---the
Bus Paradox; examples from disk systems, computer security, computer
networks, software reliability.
-
Multivariate distributions: Need for the notion of multivariate
distributions; covariance and correlation, independence convolution;
matrix formulation of mean, variance, covariance; conditional
distributions, including conditional expectation and Law of Total
Expectation; multivariate normal distribution and multivariate Central
Limit Theorem; simulation of multivariate distributions; transform
functions; examples in disk systems, hash tables, computer networks,
text mining, data mining, reliability.
-
Random variate generators: Inverse transform method,
rejection/acceptance method, ad hoc methods for specific
parametric families.
-
Statistical sampling and inference: Sampling distributions;
confidence intervals based on the CLT; standard errors of estimators;
the delta method; exact confidence intervals; simultaneous confidence
intervals; bootstrap method; hypothesis testing, p-values; problems with
hypothesis testing; mean squared errors of estimators, bias, variance;
Method of Maximum Likelihood; Method of Moments; Bayesian methods; the
empirical CDF; nonparametric density estimators.
-
Model building: Model building: the problem of overfitting,
chi-square goodness of fit test, Kolmogorov-Smirnov confidence bands.
-
Statistical inference on discrete-event simulation output: Batch
means, regenerative method.
-
Relations---regression, classification, principle components,
contingency tables: Twin goals of prediction and understanding;
regression function as conditional mean and best predictor; linear
regression models and inference; overfitting problem in regression;
nonlinear parametric regression models; nonparametric estimators for
regression functions; regression diagnostics; the classification
problem; logistic parametric model; nonparametric classification
methods---kernel method, CART, SVM; principal components analysis;
log-linear models; examples in computer networks, simulation, data
mining, remote sensing.
- Markov
chains: Discrete-time Markov chains; notion of ergodicity; hidden
Markov models; continuous-time Markov chains; birth-death processes;
hitting times; examples in computer security, multiprocessor systems,
computer networks, text mining, reliability, tree searching.
-
Introduction to queuing models:
M/M/k model; Little's Rule; loss models; nonexponential service times;
reversed Markov chains; networks of queues.
-
Introduction to renewal theory:
Duality between "lifetime domain" and "counts domain"; Poisson
processes; relation to exponential distribution; conditional
distribution of renewal times given count; decomposition and
superposition of Poisson processes; nonhomogeneous Poisson processes;
properties of general renewal processes; alternating renewal processes;
residual life and age distributions; examples in Web page modification
rates, inventory, disk files, discrete event simulation, memory paging,
software reliability.
-
R for Programmers: My manual on R.
Author's Bio: Norm Matloff is a Professor of Computer
Science at the University of California, Davis. He is formerly a
statistics professor at that university, and thus approaches the subject
matter here as both a statistician and computer scientist. His research
has including a number of diverse areas in the two fields, and has been
a recipient of the university's Distinguished Teaching Award.