Statistics

Mean, Standard Deviation and Variance

Mean

Formula for Mean:

$\text{Mean} = \frac{\sum X}{n}$

Where:

$\sum X$ is the sum of all data points
$n$ is the number of data points in the set.

Variance

Variance measures how far each data point in the set is from the mean. It gives an idea of how spread out the data is. A small variance indicates that the data points are close to the mean, while a large variance suggests they are spread out over a wider range.

Formula for Variance:

$\text{Variance} (\sigma^2) = \frac{\sum (X_i - \mu)^2}{n}$

Where:

$\sigma^2$ is the variance
$X_i$ represents each data point in the data set.
$\mu$ is the mean of the data set.
$n$ is the number of data points in the data set.

Standard Deviation

The standard deviation is simply the square root of the variance. It provides a more interpretable measure of spread because it is in the same units as the original data (whereas variance is in squared units).

Formula for Standard Deviation:

$\text{Standar Deviation} (\sigma) = \sqrt\text{Variance}$

Example

Consider the following data set representing the ages of 5 students:

$\text{Mean} = \frac{18 + 21 + 19 + 22 + 20}{5} = \frac{100}{5} = 20$

Subtract the mean from each data point
Square each result
Find the average of these squared differences

$\text{Variance} = \frac{(18-20)^2 + (21-20)^2 + (19-20)^2 + (22-20)^2 + (20-20)^2}{5} = \frac{4 + 1 + 1 + 4 + 0}{5} = \frac{10}{5} = 2$

$\text{Standard Deviation} (\sigma) = \sqrt{\text{Variance}} = \sqrt{2} \approx 1.41$

Why Are These Measures Important in Data Science?

In data science, understanding the mean, variance, and standard deviation is essential for interpreting data, building predictive models, and evaluating the reliability of predictions.

a) Data Exploration and Summarization:

The mean can give insights into the general trend of the data. If you’re analyzing a company’s sales data, for example, the mean helps you understand the average sales over a certain period.

b) Identifying Data Distribution:

Variance and standard deviation are essential for understanding the distribution of the data. For instance:

If you’re working with financial data, such as stock prices, understanding the standard deviation tells you how volatile the market is. A higher standard deviation means greater fluctuation in the stock price, which may indicate risk.
If you’re analyzing customer ratings for a product, a low standard deviation means most customers gave similar ratings, whereas a high standard deviation means there was a greater variety in ratings.

c) Machine Learning Models:

When building machine learning models, many algorithms assume that the data is normally distributed (i.e., it follows a bell curve). Measures like the mean and standard deviation are important in:

Feature scaling (normalizing or standardizing data),
Outlier detection (data points that are far away from the mean i.e data points that lie more than one standard deviation from the mean can be considered unusual. We can talk about how extreme a data point is by talking about “how many sigmas” away from the mean it is.

d) Statistical Inference:

In hypothesis testing, knowing the variance and standard deviation helps in determining the confidence intervals and understanding the margin of error of estimates.

e) Comparing Data Sets:

When comparing two or more data sets, the mean tells you the central tendency, and the variance/standard deviation helps you understand which data set is more consistent or reliable.

e) Example

If average of age is 50. And standard deviation is 10. Then most students’ age in the class lie between 40-60. In the bell curve it speaks about the width of the curve

Sample and Population

These terms are important because they define the scope of the data you are analyzing, and they have a direct impact on how you compute variance and other statistical measures.

1. Population

A population is the entire set of individuals, items, or data points that you’re interested in studying. It includes every possible data point that meets your criteria.

Example: If you’re studying the heights of all adult women in the United States, the population would consist of every adult woman in the U.S.
Notation: The population is typically represented by the symbol N (the total number of data points in the population).

2. Sample

A sample, on the other hand, is a subset of the population.It’s often impractical or impossible to collect data from every member of the population.

Example: If you’re studying the heights of adult women in the U.S., it would be much easier to measure the heights of a smaller group (say 100 women) than to measure the heights of every adult woman in the country.
Notation: A sample is usually represented by $n$ (the number of data points in the sample).

3. How Population and Sample Relate to Variance

Variance is a measure of how spread out the data points are from the mean. Whether you’re working with a population or a sample affects how you calculate variance.

Variance of a Population

When you calculate the variance for a population, you are using data from every member of the population, so you’re able to compute a precise measure of the spread of the data. The formula for population variance is:

$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2$

Variance of a Sample

When you’re working with a sample, you only have data from a subset of the population, so you have to estimate the variance. In sample variance we divide by n-1 not by n. This adjustment is known as Bessel’s correction. n-1 corrects for the fact that a sample tends to underestimate the true variance of the population. This correction helps ensure that the sample variance is an unbiased estimator of the population variance.

The formula for sample variance is slightly different:

$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$

Why is Bessel’s Correction Needed for Samples?

The sample mean $\bar{X}$ is generally closer to the data points than the population mean $\mu$ would be.This is because the sample mean is based only on the sample data, and is therefore more likely to be “close” to the sample points.

As a result, if you were to divide by n (as you would for the population), you would underestimate the true variance of the population. By using n−1, we make a correction that compensates for this bias and gives a more accurate estimate of the population variance.