Numerically stable methods for computing the sample mean and variance.
15 Jun 2022This post will contain some examples of numerically instable and stable methods for computing the sample mean and variance, with some references.
First, some references:
- B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 419–420, Aug. 1962, doi: 10.1080/00401706.1962.10490022. link.
- T. F. Chan, G. H. Golub, and R. J. Leveque, “Algorithms for Computing the Sample Variance: Analysis and Recommendations,” The American Statistician, vol. 37, no. 3, pp. 242–247, Aug. 1983, doi: 10.1080/00031305.1983.10483115. Published paper link, Open access technical report link.
I was writing a function to compute the sample mean and variance … and realized the way I had done it was not numerically stable. To keep this short, I’ll describe a couple methods. A more in-depth discussion of different methods and their stability is in Chan et al. 1983.
Standard two-pass algorithm
First, \(S\) is sum of squares, \(\mu\) is the sample mean, and there are \(N\) data points, represented by \(x_i\).
\[S = \sum_{i = 1}^N (x_i - \mu)^2 \label{twopasssumsqures}\]Then the sample variance is \(\frac{S}{N}\) or \(\frac{S}{N - 1}\), depending.
Within this algorithm, first the mean \(\mu\) must be computed, and there are some choices in how to arrange parentheses, giving different results.
Naïve method
One option, the naïve one, is
\[\mu=\frac{1}{N} \left( \sum_{i=1}^N x_i \right) \label{naivemean}\]in other words, add \(N\) times, then divide.
Running average
The second option is the running average, which divides and adds \(N\) times:
\[\mu= \sum_{i=1}^N \left( \frac{x_i}{N} \right) \label{runningaveragemean}\]Numerical instability of the above
The versions above both have numerical stability problems. One is that in Eq. \eqref{naivemean}, the possibility exists that we are adding a small number to a large number, losing precision as we go. Another potential problem is overflow of \(\sum_{i=1}^N x_i\).
The Running Average technique in Eq. \eqref{runningaveragemean} also has some stability problems; individual \(\frac{x_i}{N}\) might be very small, as well as the problems with adding a small number to a large number and losing precision.
Numerically stable one-pass method from Welford
Welford 1962 provides an updating algorithm:
\[\mu_n = \mu_{n-1} + (1/n)(x_n - \mu_{n-1}) \label{muupdates}\] \[S_n = S_{n-1} + \frac{n-1}{n}(x_n - \mu_{n-1} )^2 \label{sumsquaresupdates}\]Here, \(n\) is the indexing variable. Initial conditions are that the first \(\mu_0 = x_0\), and \(S_0 = 0\). Finally, \(S_N\) is divided by \(N\), or \(N-1\), depending on the context.
A quickly-written C++ function for performing the Welford algortihm is below. Please do not call this with an integer type!
template<class T>
pair<T, T> ReturnMeanVarRecursive(const T* x, int n){
T mean = 0.0;
T sumSquares = 0.0;
T tOne = T(1);
T tN = T(n);
T var = 0;
// initialize for loop
mean = x[0];
sumSquares = 0; // (x[0] - mean)^2 = (x[0] - x[0])^2 = 0
int xIndex = 1;
for (T nIndex = 2, nMinusOne = 1; xIndex < n; nIndex++, nMinusOne++, xIndex++){
sumSquares = sumSquares + (nMinusOne/nIndex)*pow((x[xIndex] - mean), 2);
mean = (nMinusOne/nIndex) * mean + (tOne/nIndex) * x[xIndex];
}
var = sumSquares/tN;
return pair<T, T>(mean, var);
}
Listing of Numerical Analysis posts.