Numerically stable methods for computing the sample mean and variance.

15 Jun 2022 c-plus-plus floating-point numerical-analysis

This post will contain some examples of numerically instable and stable methods for computing the sample mean and variance, with some references.

First, some references:

B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 419–420, Aug. 1962, doi: 10.1080/00401706.1962.10490022. link.
T. F. Chan, G. H. Golub, and R. J. Leveque, “Algorithms for Computing the Sample Variance: Analysis and Recommendations,” The American Statistician, vol. 37, no. 3, pp. 242–247, Aug. 1983, doi: 10.1080/00031305.1983.10483115. Published paper link, Open access technical report link.

I was writing a function to compute the sample mean and variance … and realized the way I had done it was not numerically stable. To keep this short, I’ll describe a couple methods. A more in-depth discussion of different methods and their stability is in Chan et al. 1983.

Standard two-pass algorithm

First, \(S\) is sum of squares, \(\mu\) is the sample mean, and there are \(N\) data points, represented by \(x_i\).

\[S = \sum_{i = 1}^N (x_i - \mu)^2 \label{twopasssumsqures}\]

Then the sample variance is \(\frac{S}{N}\) or \(\frac{S}{N - 1}\), depending.

Within this algorithm, first the mean \(\mu\) must be computed, and there are some choices in how to arrange parentheses, giving different results.

Naïve method

One option, the naïve one, is

\[\mu=\frac{1}{N} \left( \sum_{i=1}^N x_i \right) \label{naivemean}\]

in other words, add \(N\) times, then divide.

Running average

The second option is the running average, which divides and adds \(N\) times:

\[\mu= \sum_{i=1}^N \left( \frac{x_i}{N} \right) \label{runningaveragemean}\]

Numerical instability of the above

The versions above both have numerical stability problems. One is that in Eq. \eqref{naivemean}, the possibility exists that we are adding a small number to a large number, losing precision as we go. Another potential problem is overflow of \(\sum_{i=1}^N x_i\).

The Running Average technique in Eq. \eqref{runningaveragemean} also has some stability problems; individual \(\frac{x_i}{N}\) might be very small, as well as the problems with adding a small number to a large number and losing precision.

Numerically stable one-pass method from Welford

Welford 1962 provides an updating algorithm:

\[\mu_n = \mu_{n-1} + (1/n)(x_n - \mu_{n-1}) \label{muupdates}\] \[S_n = S_{n-1} + \frac{n-1}{n}(x_n - \mu_{n-1} )^2 \label{sumsquaresupdates}\]

Here, \(n\) is the indexing variable. Initial conditions are that the first \(\mu_0 = x_0\), and \(S_0 = 0\). Finally, \(S_N\) is divided by \(N\), or \(N-1\), depending on the context.

A quickly-written C++ function for performing the Welford algortihm is below. Please do not call this with an integer type!

template<class T>
pair<T, T> ReturnMeanVarRecursive(const T* x, int n){
    
    T mean = 0.0;
    T sumSquares = 0.0;
    T tOne = T(1);
    T tN = T(n);
    T var = 0;

    // initialize for loop
    mean =  x[0];
    sumSquares = 0; // (x[0] - mean)^2 = (x[0] - x[0])^2 = 0

    int xIndex = 1;
    for (T nIndex = 2, nMinusOne = 1; xIndex < n; nIndex++, nMinusOne++, xIndex++){
        sumSquares = sumSquares + (nMinusOne/nIndex)*pow((x[xIndex] - mean), 2);
        mean = (nMinusOne/nIndex) * mean +  (tOne/nIndex) * x[xIndex];
    }

    var = sumSquares/tN;

    return pair<T, T>(mean, var);
}

Listing of Numerical Analysis posts.

Share or discuss.

Share on Twitter. Share on Facebook. Share on LinkedIn.

Amy Tabb

Numerically stable methods for computing the sample mean and variance.

Standard two-pass algorithm

Naïve method

Running average

Numerical instability of the above

Numerically stable one-pass method from Welford

Share or discuss.

Related Posts

My favorite print debugging method in C++. 03 Jul 2023

CMake: conditional and logical expression syntax. 06 Dec 2022

CMake: How to force reconfiguration. 29 Nov 2022

CMake: How to specify the compiler in CMakeLists.txt. 28 Nov 2022

Mixing C and C++ in one file. 27 Nov 2022