# Numerically stable methods for computing the sample mean and variance.

c-plus-plus floating-point numerical-analysis

This post will contain some examples of numerically instable and stable methods for computing the sample mean and variance, with some references.

First, some references:

• B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 419–420, Aug. 1962, doi: 10.1080/00401706.1962.10490022. link.
• T. F. Chan, G. H. Golub, and R. J. Leveque, “Algorithms for Computing the Sample Variance: Analysis and Recommendations,” The American Statistician, vol. 37, no. 3, pp. 242–247, Aug. 1983, doi: 10.1080/00031305.1983.10483115. Published paper link, Open access technical report link.

I was writing a function to compute the sample mean and variance … and realized the way I had done it was not numerically stable. To keep this short, I’ll describe a couple methods. A more in-depth discussion of different methods and their stability is in Chan et al. 1983.

## Standard two-pass algorithm

First, $$S$$ is sum of squares, $$\mu$$ is the sample mean, and there are $$N$$ data points, represented by $$x_i$$.

$S = \sum_{i = 1}^N (x_i - \mu)^2 \label{twopasssumsqures}$

Then the sample variance is $$\frac{S}{N}$$ or $$\frac{S}{N - 1}$$, depending.

Within this algorithm, first the mean $$\mu$$ must be computed, and there are some choices in how to arrange parentheses, giving different results.

## Naïve method

One option, the naïve one, is

$\mu=\frac{1}{N} \left( \sum_{i=1}^N x_i \right) \label{naivemean}$

in other words, add $$N$$ times, then divide.

## Running average

The second option is the running average, which divides and adds $$N$$ times:

$\mu= \sum_{i=1}^N \left( \frac{x_i}{N} \right) \label{runningaveragemean}$

## Numerical instability of the above

The versions above both have numerical stability problems. One is that in Eq. \eqref{naivemean}, the possibility exists that we are adding a small number to a large number, losing precision as we go. Another potential problem is overflow of $$\sum_{i=1}^N x_i$$.

The Running Average technique in Eq. \eqref{runningaveragemean} also has some stability problems; individual $$\frac{x_i}{N}$$ might be very small, as well as the problems with adding a small number to a large number and losing precision.

## Numerically stable one-pass method from Welford

Welford 1962 provides an updating algorithm:

$\mu_n = \mu_{n-1} + (1/n)(x_n - \mu_{n-1}) \label{muupdates}$ $S_n = S_{n-1} + \frac{n-1}{n}(x_n - \mu_{n-1} )^2 \label{sumsquaresupdates}$

Here, $$n$$ is the indexing variable. Initial conditions are that the first $$\mu_0 = x_0$$, and $$S_0 = 0$$. Finally, $$S_N$$ is divided by $$N$$, or $$N-1$$, depending on the context.

A quickly-written C++ function for performing the Welford algortihm is below. Please do not call this with an integer type!

template<class T>
pair<T, T> ReturnMeanVarRecursive(const T* x, int n){

T mean = 0.0;
T sumSquares = 0.0;
T tOne = T(1);
T tN = T(n);
T var = 0;

// initialize for loop
mean =  x[0];
sumSquares = 0; // (x[0] - mean)^2 = (x[0] - x[0])^2 = 0

int xIndex = 1;
for (T nIndex = 2, nMinusOne = 1; xIndex < n; nIndex++, nMinusOne++, xIndex++){
sumSquares = sumSquares + (nMinusOne/nIndex)*pow((x[xIndex] - mean), 2);
mean = (nMinusOne/nIndex) * mean +  (tOne/nIndex) * x[xIndex];
}

var = sumSquares/tN;

return pair<T, T>(mean, var);
}



## Share or discuss.

© Amy Tabb 2018 - 2023. All rights reserved. The contents of this site reflect my personal perspectives and not those of any other entity.