Numerically stable methods for computing the sample mean and variance.

c-plus-plus floating-point numerical-analysis

This post will contain some examples of numerically instable and stable methods for computing the sample mean and variance, with some references.

First, some references:

  • B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 419–420, Aug. 1962, doi: 10.1080/00401706.1962.10490022. link.
  • T. F. Chan, G. H. Golub, and R. J. Leveque, “Algorithms for Computing the Sample Variance: Analysis and Recommendations,” The American Statistician, vol. 37, no. 3, pp. 242–247, Aug. 1983, doi: 10.1080/00031305.1983.10483115. Published paper link, Open access technical report link.

I was writing a function to compute the sample mean and variance … and realized the way I had done it was not numerically stable. To keep this short, I’ll describe a couple methods. A more in-depth discussion of different methods and their stability is in Chan et al. 1983.

Standard two-pass algorithm

First, \(S\) is sum of squares, \(\mu\) is the sample mean, and there are \(N\) data points, represented by \(x_i\).

\[S = \sum_{i = 1}^N (x_i - \mu)^2 \label{twopasssumsqures}\]

Then the sample variance is \(\frac{S}{N}\) or \(\frac{S}{N - 1}\), depending.

Within this algorithm, first the mean \(\mu\) must be computed, and there are some choices in how to arrange parentheses, giving different results.

Naïve method

One option, the naïve one, is

\[\mu=\frac{1}{N} \left( \sum_{i=1}^N x_i \right) \label{naivemean}\]

in other words, add \(N\) times, then divide.

Running average

The second option is the running average, which divides and adds \(N\) times:

\[\mu= \sum_{i=1}^N \left( \frac{x_i}{N} \right) \label{runningaveragemean}\]

Numerical instability of the above

The versions above both have numerical stability problems. One is that in Eq. \eqref{naivemean}, the possibility exists that we are adding a small number to a large number, losing precision as we go. Another potential problem is overflow of \(\sum_{i=1}^N x_i\).

The Running Average technique in Eq. \eqref{runningaveragemean} also has some stability problems; individual \(\frac{x_i}{N}\) might be very small, as well as the problems with adding a small number to a large number and losing precision.

Numerically stable one-pass method from Welford

Welford 1962 provides an updating algorithm:

\[\mu_n = \mu_{n-1} + (1/n)(x_n - \mu_{n-1}) \label{muupdates}\] \[S_n = S_{n-1} + \frac{n-1}{n}(x_n - \mu_{n-1} )^2 \label{sumsquaresupdates}\]

Here, \(n\) is the indexing variable. Initial conditions are that the first \(\mu_0 = x_0\), and \(S_0 = 0\). Finally, \(S_N\) is divided by \(N\), or \(N-1\), depending on the context.

A quickly-written C++ function for performing the Welford algortihm is below. Please do not call this with an integer type!

template<class T>
pair<T, T> ReturnMeanVarRecursive(const T* x, int n){
    
    T mean = 0.0;
    T sumSquares = 0.0;
    T tOne = T(1);
    T tN = T(n);
    T var = 0;

    // initialize for loop
    mean =  x[0];
    sumSquares = 0; // (x[0] - mean)^2 = (x[0] - x[0])^2 = 0

    int xIndex = 1;
    for (T nIndex = 2, nMinusOne = 1; xIndex < n; nIndex++, nMinusOne++, xIndex++){
        sumSquares = sumSquares + (nMinusOne/nIndex)*pow((x[xIndex] - mean), 2);
        mean = (nMinusOne/nIndex) * mean +  (tOne/nIndex) * x[xIndex];
    }

    var = sumSquares/tN;

    return pair<T, T>(mean, var);
}

Listing of Numerical Analysis posts.

© Amy Tabb 2018 - 2023. All rights reserved. The contents of this site reflect my personal perspectives and not those of any other entity.