Introduction to descriptive statistics

How to ‘understand’ a dataset? Ideally, we need to look at each value of the dataset. This is not feasible if the data are very large as looking at each ‘datapoint’ is too much of information and meaningful inferences may not be drawn. Descriptive statistics provide a summary of the given data with various useful measures.

Important measures of descriptive statistics

Measures of central tendency reveal the useful statistics about the data in terms of what is the average value, middle value and which values are occurring frequently.

Measures of Central tendency

This answers the questions on what is the average value in the given data.

The three commonly used measures are

• Mean
• Median and
• Mode

Mean

It is the average of all values in a dataset. Is is calculated as shown below

Mean = (x1 + x2 + x3 + …… + xn) / n

Where x1, x2, x3, …..,xn are values in the dataset and n is the number of values in the dataset.

Note: While mean/average is a very good measure of central tendency, outliers can influence this value and this should be noted while calculating the ‘mean’.

Median

It is the middle value in the dataset when the data is sorted.

Median = value at (n+1)/2 position

when n is ODD and data is sorted

Median = average of values at n/2 and (n/2) +1 positions

when n is EVEN and data is sorted.

Suppose we have 9 values in a dataset, median is the 5th value when the dataset is sorted.

In case of even number of values, it is the average of two middle values. For example, if we have 10 values in a dataset, median is the average of 5th and 6th values when sorted.

Note: When we have data that is skewed and it is known that there are few exceptions/outliers, it is better to consider the ‘median’ value rather than the mean

Mode

Mode is the value that is occurring maximum number of times in a dataset. A dataset can have one or more than one mode depending on the frequency of values.

Apart from mean and median that specify the average/middle value, we need to know what the deviation from this average value is in general. This leads to the concept of dispersion. Measures of dispersion

Absolute deviation

• Calculates deviation (absolute) for each value in the dataset from the ‘mean’
• Deviation can be either above mean or below mean, but, we are interested in calculating deviation regardless of direction. Hence, we take absolute deviation
• Sum of all such absolute deviations is divided by the number of values to get ‘average absolute deviation’.

(|x1 – mean|+|x2 – mean| + |x3 – mean| + …… + |xn – mean|) / n

Where x1, x2, x3, …,xn are values in the dataset and n is the number of values in the dataset.

Variance

• In the calculation of absolute deviation, we have used ‘absolute’ values so as to remove the effect of direction of deviation from mean.
• One other method to remove the effect of direction is to square the above values.
• Sum of squared deviation of each value from the mean divided by the number of values.

Variance =  [(x1 – mean)^2+(x2 – mean)^2 +(x3 – mean)^2  + …… + (xn – mean)^2]  / n

Note

• As we can observe, squaring of deviations gives more weight to the values distant from mean compared to the ones near the mean in the calculation.
• Squaring of deviations makes the units of variance different from that of ‘x’.

To bring the units of measure of dispersion same as the units of ‘x’, standard deviation is the measure used which is simply the square root of variance.

Standard deviation

Standard deviation = SQRT (Variance)

Population Standard deviation =

SQRT {[(x1 – mean)^2+(x2 – mean)^2 +(x3 – mean)^2  + …… + (xn – mean)^2]  / n}

Sample Standard deviation =

SQRT {[(x1 – mean)^2+(x2 – mean)^2 +(x3 – mean)^2  + …… + (xn – mean)^2]  / (n-1)}

When to use population standard deviation and when to use sample standard deviation?

The formula for population standard deviation is used when

• Our dataset contains all the values of population of interest

OR

• We are interested only in the values present in our dataset and not about the entire population or values outside our dataset.

Sample standard deviation is an estimation of population standard deviation from sample dataset.

This is used when

• Our dataset is a sample from a population and we are interested in calculating the standard deviation of population using the values in our sample dataset.

Read also Skewness and Kurtosis