One of the most misunderstood topics in statistics is confidence intervals, nevertheless, we will try to break this topic into pieces and understand it thoroughly. In this section, we will be covering mainly the conceptual aspects of confidence intervals before we deep dive into mathematical calculation part in our next post.
Why do we require a confidence interval
Answer is ESTIMATION – ‘Because we cannot cover the entire population to get an understanding of its characteristics, we usually draw a representative sample and estimate the characteristics of population based on the sample characteristics. Such estimation is associated with certain amount of uncertainty which leads to the concept of confidence intervals’.
Let us take a classical example; suppose, we want to calculate the average height of all children belonging to a particular age group (8 to 10 years) in the state of Texas.
It would be practically not feasible to calculate the height of all children, so, we do random sampling from the population and calculate the sample statistics from which we try to infer about the entire population. Here the ‘population’ is all children in Texas in the age group of 8 to 10 years.
Population parameter and Sample statistic
Let us take a look at what the terms sample statistic and population parameter imply. Sample statistic is a measure of a characteristic calculated from sample data while the population parameter is a measure of characteristic calculated from the data on entire population. Population parameter is a fixed one for a particular characteristic while sample statistic may vary across samples and depends on a variety of factors.
In our example, suppose a sample of 300 children is taken and the average height calculated comes out to be ‘x’ inches. Precisely calculating the average height of population is not possible just by looking at this number ‘x’ inches (from sample data). Hence, we tend to specify confidence level as well as confidence interval range to provide an idea on the average height of entire population.
Confidence interval is a range of values associated with a particular confidence level.
For example, we can specify the confidence interval range as (x-a) inches to (x+a) inches for average height of population with 90% confidence levels. Usually, in experiments, 90%, 95% or even 99% confidence levels and corresponding confidence intervals are used. However, one can use any confidence level and confidence interval range based on the requirement and fit.
Confidence interval – Range of values covering either side of sample statistic.
90% confidence level – If samples are drawn repeatedly from population and the corresponding confidence interval range is calculated for 90% confidence level, then, in 90% of these individual confidence interval ranges we can find the actual population parameter. We tend to avoid the word probability here as it may confuse the reader.
In our example, we take sample of 300 children calculate the 90% confidence interval range for sample and see if the actual average height of population falls in this range. If this process of drawing sample and calculating 90% interval range is repeated 100 times then there is a chance that in 90 of these cases the actual population parameter falls in the ranges calculated.
Confidence interval and probability
Can we specify that for 90% confidence interval, there is 90% probability that the actual population parameter falls in the calculated confidence interval range from a single sample? The answer is debatable and many experts answer it as ‘No’. This is mainly because; the actual population parameter is a fixed value and whether it falls or doesn’t fall in the calculated range is Either YES or NO. Hence, specifying the confidence level as probability is not a good idea.
Although, this could be argued and debated, in my personal opinion, it is reasonable to assume that ‘confidence level implies probability of population parameter falling in the calculated range’. The reason is, in most of the practical cases we do sampling only once and try to estimate the population. Hence, the range arrived from this sampling holds little importance if it does not imply anything on the chance of population parameter falling inside it.
We leave this article with few questions to ponder that we try to answer mathematically in our next post.
What if one sample is significantly different from another one? Will the estimated confidence interval ranges from both these samples contain the population parameter with same level of confidence?
What if the values in the population are highly scattered? (High standard deviation of population)
What is the effect of sample size in this exercise? One could take 300 children while another may take 700 children in the sample!