As we have seen in our previous section ‘Introduction to simple linear regression’, regression equation is a straight line of the form mentioned below
y = b0 + b1x1
Any value of variable ‘Y’ in the observed data can be represented as
y = b0 + b1x1 + e
Where e is the error or residual
Parameters in regression
To build a regression equation that fits the given data, we need to estimate the below regression parameters
b0: y intercept (Average value of y at x = 0)
b1: Slope of the regression line (Explains the linear relationship between y and x)
Methods of regression
One of the most widely used approaches for regression is the least squares method. This method finds the regression line that minimises the sum of squares of all the ‘errors/residuals’.
Recall, error is defined as the vertical distance between each datapoint and the regression line.
Least squares method is again of different types such as ‘ordinary least squares’ and ‘generalised least squares’. Here, we will discuss ordinary least squares regression.
Before moving further, let us understand the concept of sum of squares.
Sum of squares and cross product
Sum of squares for a variable x is defined as sum of squares of deviation of each value of x from its mean.
Deviation of each value of x = (xi – x̅)
Squared deviation of each value of x = (xi – x̅) ^2
Sum of Squared deviations for all values of x = ∑ (xi – x̅) ^2 for i =1 to n
Below are the equations for calculating sum of squares and cross product.
Let us take an example dataset to understand the terms involved in building regression model.
Example dataset is given below
Regression equation for the above data is presented in the below graph
Total deviation from the mean (of y) can be expressed as sum of two deviations
- Deviation of actual y value from predicted y value using regression line (Error)
- Deviation of Predicted y value using regression from mean (of y)
Graphically, the above deviations and total deviation is shown in the image above
We would like to find the line/equation that minimises the value of SSE (Sum of squares of errors). Using calculus we can find out the line that satisfies this condition.
Slope of least squares regression line is given by
Once we calculate the slope of the regression line, we can calculate the y-intercept using the concept ‘Regression line always passes through the point (x̅, y̅)’ to calculate the intercept b0
Confidence intervals of Slope parameter
The regression model we build is based on observed data and hence it is the coefficients calculated are considered sample statistics. Like with any sample statistic, we have the standard error associated slope and intercept parameters in regression. Also, we can build confidence intervals for these regression parameters.
Standard error associated with the estimate of slope parameter is given by
Confidence intervals for slope parameter are given by
Once the regression model is developed and equation of line is calculated, we need to check if the model is robust enough to make inferences or to make predictions. For this, we will check the ‘Goodness of fit tests‘ for regression which we will discuss in our next section.