Simply put, regression is used for modelling the relationship among variables. Historical or observed data points for variables under study are put into regression model.
Regression involves fitting a linear equation between dependent and independent variables (Dependent variable is the outcome/response variable and independent variable is a factor affecting dependent variable).
Objectives of regression
- Analyse the relationship among variables
- Level of influence of factors (independent variables) on the outcome variable
- Useful in prediction of outcome variable for a given level of factor/independent variable.
A company wants to analyse the relationship of revenues (Sales) and advertising spend. Here, the outcome/response variable is ‘Sales’ and the factor (Independent variable) affecting it is ‘Advertising spend’.
Once the assumptions are met and a reliable regression model is developed, it could be used to predict the sales in future for a given level of advertising spend. One should be however cautious as many other factors apart from advertising spend influence the sales numbers.
Simple linear regression
Only one independent variable or factor is involved in regression
Dependent variable as a function of independent variables
y = f(x1, x2, x3, x4, …., xn)
In Simple linear regression we have only one factor, the equation becomes
y = f(x1)
As we are fitting a straight line (finding a linear relationship), equation becomes
y = b0 + b1x1
y is dependent variable
x1 is independent variable
b0 is the y-intercept and
b1 is the slope of the equation
Scatterplot and the best fit line
It is good to have a visual depiction of available data so as to broadly understand the trends and patterns.
In regression model, we fit a straight line or equation that best fits the data as shown below
Equation of best fit line in above plot is
y = 89,290 + 2.6129 x
Errors or residuals in regression
Values of y given in the dataset could be represented as
y = 89,290 + 2.6129 x + e
Where e is the error term or residual value
- e is +ve if the value of y for a datapoint is above regression line
- e is -ve if the value of y for a datapoint is below regression line
Error = Actual y value – calculated y value using regression equation
Hence, the residual value or error is defined as the vertical distance between the datapoint and the regression equation.
We will see in our next section how to build regression equation; rest of this section is focused on assumptions of regression model.
Assumptions of linear regression
- Linear relationship between dependent variable and independent variables
- Normal distribution of residuals with zero mean and fixed variance
- Errors are independent of each other and do not form a pattern
- values of x (Independent variable) are not random; it means there is no error associated in measuring the variable x.
Below is a brief explanation of these assumptions; we will once again deal with them in our next section while we discuss on fit of the regression line and how good the developed model is!
As we are fitting a linear equation to explain the relationship between y and x, it is assumed that these variables indeed have a true linear relationship.
Imagine a case (example below), where the relationship between y and x is clearly non-linear but we are trying to fit a linear equation!
Similarly, the below example show where the relationship between y and x is not close to linear.
Tests for linear relationship
- Scatter plot, correlation coefficient and slope of the regression equation provide an idea on linear relationship between y and x
- Correlation coefficient close to zero – NO linear relationship
- Slope of the line should be either +ve or –ve for linear relationship
Distribution of errors
Errors or residuals should be spread on either side of regression line with mean zero and have a fixed variance. Also, these residuals should form a normal distribution.
In cases where the distribution is non-normal, still regression could be performed (methods used in such cases are discussed in coming sections)
Values of y at each value of x must be normally distributed
Variance of y at each value of x is same (homogeneity of variances)
Errors are independent of each other
Errors should not increase or decrease as x increases. This means errors must be purely random and they should not form a pattern. They should not be increasing or decreasing with time.
Errors must be ‘Purely random’; if they are not random and form a pattern, they must have been accommodated in our regression equation!!
We can observe this distribution by plotting residuals vs x values.