## What is the difference between N and N-1 in calculating population variance?

I did not get the why there are `N`

and `N-1`

while calculating population variance. When we use `N`

and when we use `N-1`

?

Click here for a larger version

It says that when population is very big there is no difference between N and N-1 but it does not tell why is there N-1 at the beginning.

Edit: Please don’t confuse with `n`

and `n-1`

which are used in estimating.

Edit2: I’m not talking about population estimation.

$N$ is the population size and $n$ is the sample size. The question asks why the population variance is the mean squared deviation from the mean rather than $(N-1)/N = 1-(1/N)$ times it. For that matter, why stop there? Why not multiply the mean squared deviation by $1-2/N$, or $1-17/N$, or $exp(-1/N)$, for instance?

There actually is a good reason not to. Any of these figures I just mentioned would serve just fine as a way to quantify a “typical spread” within the population. However, without prior knowledge of the population size, it would be impossible to use a random sample to find an unbiased estimator of such a figure. We know that the

samplevariance, which multiplies the mean squared deviation from the sample mean by $(n-1)/n$, is an unbiased estimator of the usual population variance when sampling with replacement. (There is no problem with making this correction, because we know $n$!) The sample variance would therefore be abiasedestimator of any multiple of the population variance where that multiple, such as $1-1/N$, is not exactly known beforehand.This problem of some unknown amount of bias would propagate to all statistical tests that use the sample variance, including t-tests and F-tests. In effect, dividing by anything other than $N$ in the population variance formula would require us to change all statistical tabulations of t-statistics and F-statistics (and many other tables as well),

but the adjustment would depend on the population size.Nobody wants to have to make tables for every possible $N$! Especially when it’s not necessary.As a practical matter, when $N$ is small enough that using $N-1$ instead of $N$ in formulas makes a difference, you usually

doknow the population size (or can guess it accurately) and you would likely resort to much more substantial small-population corrections when working with random samples (without replacement) from the population. In all other cases, who cares? The difference doesn’t matter. For these reasons, guided by pedagogical considerations (namely, of focusing on details that matter and glossing over details that don’t), some excellent introductory statistics texts don’t even bother to teach the difference: they simply provide a single variance formula (divide by $N$ or $n$ as the case may be).There has, in the past been an argument that you should use N for a non-inferential variance but I wouldn’t recommended that anymore. You should always use N-1. As sample size decreases N-1 is a pretty good correction for the fact that the sample variance gets lower (you’re just more likely to sample near the peak of the distribution—see figure). If sample size is really big then it doesn’t matter any meaningful amount.

An alternative explanation is that the population is a theoretical construct that’s impossible to achieve. Therefore, always use N-1 because whatever you’re doing you’re, at best, estimating the population variance.

Also, you’re going to be seeing N-1 for variance estimates from here on in. You’ll likely not ever encounter this issue… except on a test when your teacher might ask you to make a distinction between an inferential and non-inferential variance measure. In that case don’t use whuber’s answer or mine, refer to ttnphns’s answer.

Note, in this figure the variance should be close to 1. Look how much it varies with sample size when you use N to estimate the variance. (this is the “bias” referred to elswhere)

Instead of going into maths I’ll try to put it in plain words. If you have the whole population at your disposal then its variance (

population variance) is computed with the denominator`N`

. Likewise, if you have only sample and want to compute thissample variance, you use denominator`N`

. In both cases, note, you don’testimateanything: the mean that you measured is the true mean and the variance you computed from that mean is the true variance.Now, you have only sample and want to infer about the unknown mean and variance in the population. In other words, you want

estimates. You take your sample mean for the estimate of population mean (because your sample is representative), OK. To obtain estimate of population variance, you have to pretend that that mean is really population mean and therefore it isnot dependent on your sample anymoresince when you computed it. To “show” that you now take it as fixed you reserve one (any) observation from your sample to “support” the mean’s value: whatever your sample might have happened, one reserved observation could always bring the mean to the value that you’ve got and which believe is insensitive to sampling contingencies. One reserved observation is “-1″ and so you have`N-1`

in computing variance estimate.Imagine that you somehow know the true population mean, but want to estimate variance from the sample. Then you will substitute that true mean into the formula for variance and apply denominator

`N`

: no “-1″ is needed here since youknowthe true mean, you didn’t estimate it from this same sample.The population variance is the sum of the squared deviations of all of the values in the population divided by the number of values in the population. When we are estimating the variance of a population from a sample, though, we encounter the problem that the deviations of the sample values from the mean of the sample are, on average, a little less than the deviations of those sample values from the (unknown) true population mean. That results in a variance calculated from the sample being a little less than the true population variance. Using an n-1 divisor instead of n corrects for that underestimation.

You could have a better feeling about this question when playing with octave or matlab…

Example:

you will verify a significant difference between

`var1`

and`var2`

, since your sample size is very small. Repeat it by considering a larger population size.you will verify that

`var1`

$approx$`var2`