Introduction

As we have seen in our previous section ‘Introduction to simple linear regression’, regression equation is a straight line of the form   mentioned below

y = b0 + b1x1

Any value of variable ‘Y’ in the observed data can be represented as

y = b0 + b1x1 + e

   Where e is the error or residual

Regression equation

regree1-3

Parameters in regression

To build a regression equation that fits the given data, we need to estimate the below regression parameters

b0: y intercept (Average value of y at x = 0)

b1: Slope of the regression line (Explains the linear relationship between y and x)

Methods of regression

One of the most widely used approaches for regression is the least squares method. This method finds the regression line that minimises the sum of squares of all the ‘errors/residuals’.

Recall, error is defined as the vertical distance between each datapoint and the regression line.

Least squares method is again of different types such as ‘ordinary least squares’ and ‘generalised least squares’. Here, we will discuss ordinary least squares regression.

Before moving further, let us understand the concept of sum of squares.

Sum of squares and cross product

Sum of squares for a variable x is defined as sum of squares of deviation of each value of x from its mean.

Deviation of each value of x = (xi – x̅)

Squared deviation of each value of x = (xi – x̅) ^2

Sum of Squared deviations for all values of x = ∑ (xi – x̅) ^2   for i =1 to n

Below are the equations for calculating sum of squares and cross product.

Regree1.1.jpg

Let us take an example dataset to understand the terms involved in building regression model.

Example dataset is given below

Regree1.2.jpg
Sample Dataset

Regression equation for the above data is presented in the below graph

Regree1.6.jpg

Regree1.5.jpg

Total deviation from the mean (of y) can be expressed as sum of two deviations

  • Deviation of actual y value from predicted y value using regression line (Error)
  • Deviation of Predicted y value using regression from mean (of y)

Graphically, the above deviations and total deviation is shown in the image above

Regree1.7.jpg

We would like to find the line/equation that minimises the value of SSE (Sum of squares of errors). Using calculus we can find out the line that satisfies this condition.

Slope of least squares regression line is given by

Regree1.9.jpg

Once we calculate the slope of the regression line, we can calculate the y-intercept using the concept ‘Regression line always passes through the point (x̅, y̅)’ to calculate the intercept b0

Regree1.13.jpg

Confidence intervals of Slope parameter

The regression model we build is based on observed data and hence it is the coefficients calculated are considered sample statistics. Like with any sample statistic, we have the standard error associated slope and intercept parameters in regression. Also, we can build confidence intervals for these regression parameters.

Standard error associated with the estimate of slope parameter is given by

Regree1.11.jpg

Confidence intervals for slope parameter are given by

Regree1.12.jpg

Once the regression model is developed and equation of line is calculated, we need to check if the model is robust enough to make inferences or to make predictions. For this, we will check the Goodness of fit tests for regression which we will discuss in our next section.