Introduction to goodness of fit tests in linear regression
Once the regression model is developed, it is important to check if the model is robust enough and the key assumptions of test are met before applying the model for drawing insights and making predictions.
As we have seen earlier, one of the important assumptions of linear regression is the existence of linear relationship between dependent variable (y) and independent variable (x).
A quick check for this is to look at the scatterplot of the data and check if there is a relationship (positive or negative). However, a more robust test to check the existence of linear relationship is to conduct a hypothesis test.
Hypothesis test for checking the linear relationship
Null Hypothesis: No linear relationship
Alternative hypothesis: Linear relationship exists
As the slope of the regression equation is an indicator of linear relationship, we use it for hypothesis testing. Positive slope indicates positive correlation, negative slope indicates negative correlation between variables y and x while slope of zero indicates no correlation and hence no linear relation between y and x.
Null Hypothesis: Slope = zero
Alternative hypothesis: Slope NOT EQUAL to zero
t test is used for this hypothesis test with n-2 as number of degrees of freedom and the t-statistic for this test is calculated as,
p value corresponding to the above t statistic is computed and it is compared to the critical value (Alpha) for the desired confidence levels to either reject the null hypothesis or to accept it.
RMSE (Root Mean Square Error)
The objective of Least squares regression method is to find a regression line that minimises the sum of squares of all the errors or residuals (Minimise SSE)
Mean square error and Root mean square error of regression model are calculated as shown below
Lower the value of RMSE the better the regression model is. RMSE is NOT a relative measure and regression models cannot be compared based on RMSE values. Imagine a case where the average y value is in millions and another case where the average y value is in hundreds. Can we compare the RMSE for both these cases?
The answer is no. There must be a relative measure by which we can understand how good/robust the regression model is. The relative measure is coefficient of determination.
r2 (Coefficient of determination)
r2 is a relative measure and it is given by the equation below
r2, the coefficient of determination is the measure of how well the regression model fits the data. It measures the percentage of total deviation explained by the regression model.
Higher r2 value signifies a good regression model.
As shown above the total deviation in the data can be split into two parts, one that is explained by regression and the other is random part (Error). Percentage of total deviation that is explained by regression model is defined as coefficient of determination.
In Simple linear regression we consider only one explanatory/independent variable to explain the dependent/outcome variable. In practice there could be some other factors that may be affecting the outcome variable. Inclusion of such variables is important for modelling and this will lead to multiple regression.
One way to identify some variables that may be affecting the model can be found by observing the plot of residuals. The assumptions of Normality and equal variance of residuals can also be found from the plot of residuals.
Plot of residuals
Residuals are plotted across the independent variable to see if it follows any pattern.
Ideally, residuals should not follow any particular pattern (neither increase or decrease as x increases). A scatterplot with residuals on y axis and values of x (Independent variable) on x axis should be drawn and the resultant plot should not have any correlation.
The assumption of equal variance of residuals (Homoscedasticity) is evident from the variation in height of scatterplot with increasing x. For example, in the below plot the variances of residuals is not equal across the values of x and the assumption of equal variance of errors is not satisfied.
Below is another plot of residuals that does not form a pattern and it signifies a good regression model. Errors are purely random (that cannot be explained and factored into our model) in a good regression model and hence they should not form any pattern with independent variable x.
Similarly, other variables that might be affecting the outcome can be plotted on x axis while residuals should be on y axis. If this plot shows correlation then such variables should be included in regression model (Multiple regression).
Interpreting the output of regression
Usually the simple regression output from any software consists of the coefficients of intercept and the slope. standard errors of these estimates is presented along with the corresponding t values and p-values as shown below.