**Introduction**

Simply put, regression is used for modelling the relationship among variables. Historical or observed data points for variables under study are put into regression model.

Regression involves fitting a linear equation between dependent and independent variables (Dependent variable is the outcome/response variable and independent variable is a factor affecting dependent variable).

**Objectives of regression**

- Analyse the relationship among variables
- Level of influence of factors (independent variables) on the outcome variable
- Useful in prediction of outcome variable for a given level of factor/independent variable.

__Example__

A company wants to analyse the relationship of revenues (Sales) and advertising spend. Here, the outcome/response variable is ‘Sales’ and the factor (Independent variable) affecting it is ‘Advertising spend’.

Once the assumptions are met and a reliable regression model is developed, it could be used to predict the sales in future for a given level of advertising spend. One should be however cautious as many other factors apart from advertising spend influence the sales numbers.

**Simple linear regression**

Only one independent variable or factor is involved in regression

**Regression equation**

Dependent variable as a function of independent variables

** y = f(x _{1}, x_{2}, x_{3}, x_{4}, …., x_{n})**

In Simple linear regression we have only one factor, the equation becomes

** y = f(x _{1})**

As we are fitting a straight line (finding a linear relationship), equation becomes

** y = b _{0} + b_{1}x_{1}**

*Where*

y is dependent variable

x_{1 }is independent variable

b_{0} is the y-intercept and

b_{1 }is the slope of the equation

**Scatterplot and the best fit line**

It is good to have a visual depiction of available data so as to broadly understand the trends and patterns.

In regression model, we fit a straight line or equation that best fits the data as shown below

Equation of best fit line in above plot is

**y = 89,290 + 2.6129 x**

**Errors or residuals in regression**

Values of y given in the dataset could be represented as

**y = 89,290 + 2.6129 x + e**

Where e is the error term or residual value

- e is +ve if the value of y for a datapoint is above regression line
- e is -ve if the value of y for a datapoint is below regression line

** ****Error = Actual y value – calculated y value using regression equation**

Hence, the residual value or error is defined as the vertical distance between the datapoint and the regression equation.

We will see in our next section how to build regression equation; rest of this section is focused on assumptions of regression model.

**Assumptions of linear regression**

- Linear relationship between dependent variable and independent variables
- Normal distribution of residuals with zero mean and fixed variance
- Errors are independent of each other and do not form a pattern
- values of x (Independent variable) are not random; it means there is no error associated in measuring the variable x.

*Below is a brief explanation of these assumptions; we will once again deal with them in our next section while we discuss on fit of the regression line and how good the developed model is!*

**Linear relationship**

As we are fitting a linear equation to explain the relationship between y and x, it is assumed that these variables indeed have a true linear relationship.

Imagine a case (example below), where the relationship between y and x is clearly non-linear but we are trying to fit a linear equation!

Similarly, the below example show where the relationship between y and x is not close to linear.

#### Tests for linear relationship

- Scatter plot, correlation coefficient and slope of the regression equation provide an idea on linear relationship between y and x
- Correlation coefficient close to zero – NO linear relationship
- Slope of the line should be either +ve or –ve for linear relationship

**Distribution of errors**

Errors or residuals should be spread on either side of regression line with mean zero and have a fixed variance. Also, these residuals should form a normal distribution.

In cases where the distribution is non-normal, still regression could be performed (methods used in such cases are discussed in coming sections)

**Values of y at each value of x must be normally distributed**

**Variance of y at each value of x is same (homogeneity of variances)**

**Errors are independent of each other**

Errors should not increase or decrease as x increases. This means errors must be ** purely random** and they should not form a pattern. They should not be increasing or decreasing with time.

Errors must be ‘Purely random’; if they are not random and form a pattern, they must have been accommodated in our regression equation!!

We can observe this distribution by plotting residuals vs x values.