Introduction to goodness of fit tests in linear regression

We have discussed, the concept of linear regression and How to find regression equation? in our earlier sections.

Once the regression model is developed, it is important to check if the model is robust enough and the key assumptions of test are met before applying the model for drawing insights and making predictions.

As we have seen earlier, one of the important assumptions of linear regression is the existence of linear relationship between dependent variable (y) and independent variable (x).  

A quick check for this is to look at the scatterplot of the data and check if there is a relationship (positive or negative). However, a more robust test to check the existence of linear relationship is to conduct a hypothesis test.

Hypothesis test for checking the linear relationship

Null Hypothesis: No linear relationship

Alternative hypothesis: Linear relationship exists

As the slope of the regression equation is an indicator of linear relationship, we use it for hypothesis testing. Positive slope indicates positive correlation, negative slope indicates negative correlation between variables y and x while slope of zero indicates no correlation and hence no linear relation between y and x.

Null Hypothesis: Slope = zero

Alternative hypothesis: Slope NOT EQUAL to zero

t test is used for this hypothesis test with n-2 as number of degrees of freedom and the t-statistic for this test is calculated as,Regs1.1.jpg

Regs1.2.jpg

p value corresponding to the above t statistic is computed and it is compared to the critical value (Alpha) for the desired confidence levels to either reject the null hypothesis or to accept it.

RMSE (Root Mean Square Error)

The objective of Least squares regression method is to find a regression line that minimises the sum of squares of all the errors or residuals (Minimise SSE)

Mean square error and Root mean square error of regression model are calculated as shown below

Regs1.3.jpg

Lower the value of RMSE the better the regression model is. RMSE is NOT a relative measure and regression models cannot be compared based on RMSE values. Imagine a case where the average y value is in millions and another case where the average y value is in hundreds. Can we compare the RMSE for both these cases?

The answer is no. There must be a relative measure by which we can understand how good/robust the regression model is. The relative measure is coefficient of determination.

r2 (Coefficient of determination)

ris a relative measure and it is given by the equation below

Regree1.8.jpg

r2, the coefficient of determination is the measure of how well the regression model fits the data. It measures the percentage of total deviation explained by the regression model.

Higher r2 value signifies a good regression model.

Regree1.7.jpg

As shown above the total deviation in the data can be split into two parts, one that is explained by regression and the other is random part (Error). Percentage of total deviation that is explained by regression model is defined as coefficient of determination.

In Simple linear regression we consider only one explanatory/independent variable to explain the dependent/outcome variable. In practice there could be some other factors that may be affecting the outcome variable. Inclusion of such variables is important for modelling and this will lead to multiple regression.

One way to identify some variables that may be affecting the model can be found by observing the plot of residuals. The assumptions of Normality and equal variance of residuals can also be found from the plot of residuals.

Plot of residuals

Residuals are plotted across the independent variable to see if it follows any pattern.

Ideally, residuals should not follow any particular pattern (neither increase or decrease as x increases). A scatterplot with residuals on y axis and values of x (Independent variable) on x axis should be drawn and the resultant plot should not have any correlation.

The assumption of equal variance of residuals (Homoscedasticity) is evident from the variation in height of scatterplot with increasing x. For example, in the below plot the variances of residuals is not equal across the values of x and the assumption of equal variance of errors is not satisfied.

Regs1.4.jpg

Below is another plot of residuals that does not form a pattern and it signifies a good regression model. Errors are purely random (that cannot be explained and factored into our model) in a good regression model and hence they should not form any pattern with independent variable x.

Regs1.5.jpg

Similarly, other variables that might be affecting the outcome can be plotted on x axis while residuals should be on y axis. If this plot shows correlation then such variables should be included in regression  model (Multiple regression).

Interpreting the output of regression

Usually the simple regression output from any software consists of the coefficients of intercept and the slope. standard errors of these estimates is presented along with the corresponding t values and p-values as shown below.

Regs1.6.jpg