Regression

Students cheer on the Redhawks during a sporting event at Miami University.

Regression can be used for estimating relationships between variables and/or for prediction purposes. There are several approaches to conducting a regression, including linear and non-linear. To introduce the topic, we will discuss simple linear regression, multiple linear regression, and logistic regression.

Simple Linear Regression

Linear regression is a tool used in statistics to find a line to model past data. It is sometimes referred to as finding the line of best fit. The basic model for a simple linear regression, or a regression model containing only one input variable, is:

y equals beta zero plus beta one x one plus epsilon

where:
  • beta zero is the intercept parameter
  • x one signifies the independent, explanatory, input, or regressor variable
  • beta one is the slope parameter
  • y signifies the dependent, output, or response variable
  • and the final epsilon term represents the random error

As an example, let’s look at the mtcars dataset in R. This dataset contains information about 11 different characteristics on 32 different cars. Let’s first save this example dataset in R as cars and then use names() to view the names of the 11 different variables.

cars <- mtcars
names(cars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

***Note: More information about this dataset can be found using help('mtcars')

Suppose we are interested in a model where the displacement of a vehicle’s engine is used to explain the variation in fuel efficiency. In other words, we're interested in using the size of the engine to explain how much fuel it needs to operate. Then our assumed regression model would be:

MPG equals beta zero plus beta one X displacement plus epsilon

Before we continue investigating this model, there are four ‘LINE’ assumptions we need to discuss:

  1. Checking for Llinearity: A linear relationship needs to exist between the response and the regressor. Check by looking at a scatterplot of x versus y.
  2. Checking for Independence: The individual errors of the model must be independent of each other. Though it is unlikely to obtain the actual errors, we can check this assumption by looking at a scatterplot of the residuals versus the order in which the observations were taken.
  3. Checking for Normality: Here we assume that the errors of the model are normally distributed. This can be investigated by looking at a Normal probability plot or a Normal Q-Q plot of the residuals in addition to a histogram of the residuals.
  4. Checking for Equal variance: We assume that the errors of the model have homogeneous variance. No matter what value the regressor takes within our modeling space, the errors of the model are equally spread out. For this assumption, we can view a scatterplot of the residuals against the predicted (or fitted) values. If met, you should see a uniform spread of all points in the direction of the residuals.

Notice that the first assumption is related to a relationship between the independent and dependent variables, while the remaining three assumptions are all related to the errors. Therefore, assuming a linear relationship exists should be verified before fitting a model to the data and estimating the parameters. If the linear assumption makes sense, then the remaining three assumptions concerning the errors need to be checked. This requires estimating the two parameters (intercept and slope) in order to come up with the residuals (differences between observed and predicted response), which give us some idea as to the characteristics of the errors given we have correctly assumed the true model. If the model meets these assumptions, we can then proceed to investigate the usefulness of this model.

Linearity Assumption

First, let's check if assuming a linear relationship makes sense:

cars <- mtcars
plot(cars$mpg, cars$disp, main = "MPG v.s. Displacement",
     ylab = "Displacement", xlab = "MPG")

scatter plot of displacement versus MPG. Ranges of data are around 50 to 500 cubic inches for displacement on the vertical axis and 10 to 40 miles per gallon for fuel efficiency on the horizontal axis. Relationship appears negative and linearity possible.

Though there appears to be a possible non-linear decay of mpg as displacement increases, a negative linear relationship is also plausible.

As stated earlier, to check the next three assumptions we need residuals which means we need to actually fit the model. To fit the linear model, or “best fit line”, we use the lm() function.

outputlm <- lm(formula, data)

Attribute meanings:

  • formula: specifies the mathematical relationship between y and x. In our example, the variables are mpg and disp, respectively
  • data: specifies the name of the dataset which contains the variables in the formula

Therefore, for our example:

carslm <- lm(mpg ~ disp, cars)

Notice that we saved the output of the result as carslm. We did this in order to save all the information of the model into a single R object. In doing so, we can index into this object to retrieve certain information such as parameter estimates, predicted values, and residuals. In R documentation carslm would be referred to as an ‘lm’ or ‘linear model’ object, which means we can treat carslm as a list structure. We can see this by using the class() and typeof() functions in R.

 class(carslm)
## [1] "lm"

typeof(carslm)
## [1] "list"

Recall from the Basic Syntax page, that we can index into the elements of a list by either using double brackets if we know which position our information resides or by using the $ operator if we know the names of the list elements. To view this information of the list elements we can use:

names(carslm)

## [1] "coefficients"  "residuals"    "effects"     "rank"
## [5] "fitted.values" "assign"       "qr"          "df.residual"
## [9] "xlevels"       "call"         "terms"       "model"


Before making the plots to check our remaining assumptions, let’s retrieve and save the necessary information in our R session environment.

 resids <- carslm$residuals     # or carslm[[2]] would also work
y_hat <- carslm$fitted.values

Independent Assumption

Recall, to check if the errors are independent we need to view a scatterplot of the residuals versus anything that could influence this assumption, like the order in which the observations were taken. We don’t really know the order in which they were taken (it’s not provided in the dataset), but for the sake of this example let’s assume the row order corresponds to when the observations were collected.

plot(y_hat, type = 'b')  # type='b' used to connect observations

Scatter plot of residuals versus row order. Residuals range from around negative five to positive seven on the vertical axis for the 32 data points. There is no apparent pattern noticeable.

Based upon this plot, there doesn't appear to be any clear issues with independence. It’s important to note if we can assume a simple random sample was taken that this assumption will likely be met.

Normality Assumption

Next, we need to check the normality of the errors by viewing the distribution of the residuals. We can do this by looking at both the histogram and Normal Q-Q plot of the residuals.

hist(resids)

Histogram of residuals showing a slight scew to the right. Bin size is set at 2 with a clearly identifiable peak between negative four and negative two mile per gallon.

qqnorm(resids)
qqline(resids, col = 'red')  # creates red reference line

Normal quantile plot showing data points somewhat following the slight line referencing the assumed normal distribution quantiles. There may be a concern with the upper tail residuals being larger than expected but perhaps not a problem.

We would like to see the histogram depict a symmetric, bell-shaped curve centered around zero and the points of the Normal Q-Q plot follow the diagonal reference line as closely as possible. The residuals appear to be centered around zero, but both plots suggest that the distribution of residuals is skewed towards the positive end. This is likely related to a questionable linearity assumption of our model (recall the non-linear decay?) or even to a somewhat small sample size, which suggest that the Normality assumption may be in question depending upon how knowledgeable we are of the data.

Equal Variance Assumption

Finally, we can check for equal variance of the errors by looking at the residuals versus predicted values.

plot(y_hat, resids); abline(h = 0, col = 'red')

Scatter plot of residuals versus predicted values. Residuals range from around negative four to positive seven MPG and predicted values range from around 10 MPG to 27 MPG. A pattern of positive to negative and back to positive residual is noticeable. Refer to description below graph for more details.

We would like to see a uniform scatter of points in equal amounts above and below the horizontal reference of zero. However, we can see that there is a slight negative and then positive trend occurring around a prediction of 20 mpg. Also likely related to our initial (perhaps incorrect) assumption that a simple linear model using engine displacement alone is sufficient in explaining the variation in fuel efficiency.

Inference and Prediction

If we deem all conditions met, we can continue with the model. One concept to investigate is the statistical significance of the regressor in explaining at least some of the variation in the response. In doing so, we are conducting what is called inference. To carry this out in R, we use summary() on our model object.

summary(carslm)

## 
## Call:
## lm(formula = mpg ~ disp, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ===
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10


Looking at the Coefficients section of the output, we can see that engine displacement is highly significant (p-value = 9.38e-10). However, keep in mind that we can confidently interpret these results in this manner only if all of the ‘LINE’ assumptions were found to have been met.

Prediction is another idea to consider. For example, say we are interested in predicting the fuel efficiency of a vehicle with an engine displacement of 192 cubic inches. This value is not contained in the dataset, so we would need to estimate a response for this particular displacement value. One approach is to use the predict.lm() function.

predict.lm(object, newdata)

Attribute meanings:

  • object: is an ‘lm’ object
  • newdata: is a dataframe consisting of the x value(s) you wish to predict for

predict.lm(carslm, data.frame(disp = 192))

##        1 
## 21.68655

Another approach is to use the estimated coefficients from carslm and simply calculate the result using the prediction equation

MPG predicted equals 29.6 minus 0.041 times engine displacement size

b0 <- carslm$coefficients[[1]]
b1 <- carslm$coefficients[[2]]
b0 + b1 * 192

## [1] 21.68655

Multiple Linear Regression

Multiple linear regression is an extension of the simple case in that now the model may contain multiple regressors.

y equals beta zero plus beta one X one plus beta two X two plus beta three X three plus dot dot dot plus beta k X k plus epsilon

The idea is that with each regressor that is added, more error is accounted for. However, there is a point where too many regressors can be added to the model. This is known as overfitting. When we’ve overfit a model, we are making the equation more complex than it needs to be, which can have negative consequences if conducting inference.

The mtcars dataset will again be used as an example dataset. This time, we will investigate using number of engine cylinders cyl, vehicle weight wt, and size of carburetor carb to explain the variation in horsepower hp.

The initial model we might consider could be the one containing all three regressors:

y equals beta zero plus beta one X one plus beta two X two plus beta 3 X three plus epsilon

or

horse power equals beta zero plus beta one X cylinder plus beta two X weight plus beta three X carb plus epsilon

Similar to the simple linear model case, we use the lm() function to produce our model object. However, notice that we change the model argument in lm to now include three regressors. That is, the formula argument now contains cyl + wt + carb to be regressed against hp.

cars <- mtcars
lm_CylWtCarb <- lm(hp ~ cyl + wt + carb, data = cars)

Also, as in the simple model case, we need to check our assumptions:

# Linear relationship of hp with each of the three regressors
hp <- cars$hp
cyl <- cars$cyl
wt <- cars$wt
carb <- cars$carb
par(mfcol = c(1,3))  # create a 1-by-3 array to hold all 3 plots together
plot(cyl, hp)
plot(wt, hp)
plot(carb, hp)

Three separate scatter plots with horse power on the vertical axis and for each of the three regressor variables on the horizontal axis. Number of cylinders are depicted as either four, six, or eight with a positive linear association; weight of a vehicle is depicted as also having a positive association with horse power and linearity is plausible; finally, carburator size ranges from one to eight barrels and also shows a positive association with horse power and linearity plausible.

***Note: For checking if errors are independent we’ll assume independence as before since a random sample was taken.

# Normality of Errors
resids <- lm_CylWtCarb$residuals
par(mfcol = c(1,2))
hist(resids)
qqnorm(resids); qqline(resids, col = 'red')

Side by side depiction of histogram and normal quantile plot of residuals. Histogram appears unimodal and fairly symmetric with residuals ranging from negative 60 to positive 60 horse power and a peak between zero and twenty horse power. All data points follow closely to the reference line in the normal quantile plot.

# Equal Variance of Errors
hp_hat <- lm_CylWtCarb$fitted.values
plot(hp_hat, resids); abline(h = 0, col = 'red')

Scatter plot of predicted horse power versus residuals. Though randomly scattered for the most part, the residuals do show a sign of possible increase in variation as predicted values increase. However, overall there does not appear to be any serious issue.

Next, let’s assume we are interested in seeing if all three factors (i.e., regressor variables) are statistically significant when in the same model.

summary(lm_CylWtCarb)

## 
## Call:
## lm(formula = hp ~ cyl + wt + carb, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.002 -20.308   1.087  21.443  53.590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -48.6395    20.0553  -2.425  0.02200 *  
## cyl          23.1827     5.1573   4.495  0.00011 ***
## wt            0.1441     8.8500   0.016  0.98712    
## carb         18.2828     3.9278   4.655 7.13e-05 ***
## ===
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.01 on 28 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8084 
## F-statistic: 44.61 on 3 and 28 DF,  p-value: 8.546e-11

We can see that both cyl and carb are significant to at least an α-level of 0.001; however, the weight of a vehicle, wt, does not appear to explain a significant amount of the variation in horsepower when in a model containing the other two regressors (p−value = 0.98712).

Therefore, we should consider fitting a reduced model. That is, where wt is not included.

lm_CylCarb <- lm(hp ~ cyl + carb, cars)
summary(lm_CylCarb)

## 
## Call:
## lm(formula = hp ~ cyl + carb, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.04 -20.23   1.04  21.61  53.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -48.558     19.081  -2.545   0.0165 *  
## cyl           23.244      3.489   6.662 2.64e-07 ***
## carb          18.285      3.858   4.740 5.23e-05 ***
## ===
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.49 on 29 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8151 
## F-statistic: 69.31 on 2 and 29 DF,  p-value: 8.959e-12


Looking at the output, both cyl and carb are statistically significant with p-values < 0.0001. If we compare other model statistics (e.g., both the residual standard error is lower and Adjusted R-squared is higher for the reduced model) we also see that removing wt improves, or at least appears not to degrade, the quality of the fit. Also notice that the parameter estimates for the remaining two regressors barely change ( beta cylinders from 23.183 to 23.244 and beta carb from 18.283 to 18.285), which is further evidence that we were likely overfitting the model when we unnecessarily included the weight variable.

**Note: This does NOT mean that weight alone isn’t significant in explaining the variation in horsepower for vehicles in this dataset! Actually, if you fit the simple linear regression model containing only wt you would see that it is highly significant. Instead, the result above illustrates that the combination of knowing how many cylinders and what size of carburetor an engine has makes knowing the weight of the car insignificant when we consider explaining the variation in horsepower (…at least statistically speaking, whether or not it is of practical significance is another story).

In addition to the standard output summary provides, there are many other statistical tools we can implement to further compare different multiple linear regression models for this dataset. Model diagnostics, variable selection methods, consideration for multicoliniarity issues, use of information criteria, and cross validation are just a few of these other tools. Also, keep in mind that if we actually favored this model over the initial one, we would still need to check the model assumptions because this reduced model is another model. Can you verify that the reduced model

predicted horse power equals negative 48.56 plus 23.24 X cylinders plus 18.29 X carb

meets all the ‘LINE’ assumptions? Try it on your own!

Logistic Regression

Recall the ‘LINE’ assumptions.

  • Linearity relationship
  • Independent errors
  • Normally distributed errors
  • Equal variance of errors

However, some relationships by their very nature violate these assumptions. In these cases it is preferable to apply an appropriate modeling technique that acknowledges the true nature of the response. For example, consider a survey about whether a political figure will win an upcoming election.

The outcome (i.e., the response variable y) is not strictly quantitative. Rather, it is binary. That is, the outcome is one of only two options, “win” or “lose.” However, we can look at this as 1 or 0 and consider the result a probability of success or failure.

In these types of situations we use what is called a Generalized Linear Model (GLM). In R, GLMs are fit using the function glm().

glm(formula, family = gaussian, data,...)

Attribute meanings:

  • formula: a symbolic description of the model to be fit
  • family: name of the assumed error distribution and link function
  • data: dataframe, list, or environment containing the variables in the model

Note: For details about other arguments of glm() type help('glm') in your R console.

Example:
Let’s consider an experiment that investigates the survival rate of 80 tadpoles when exposed to different levels of ultraviolet light:
0% (low), 1%, 9%, 16%, 23%, 31%, 40%, 57%, 78%, and 100% (high).

The survival status is recorded by the variable Survived, where (0 = died, 1 = survived).

First, we will define the two variables and then create the dataset Exposure.

UV.Level<-c(0.01,1,0.23,0.00,0.16,0.16,0.09,0.23,0.01,0.00,0.57,0.16,1.00,0.01,1,0.4,0.78,0.23,0.4,0)
Survived <- c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0)
Exposure <- data.frame(UV.Level, Survived)


Knowing that Survival is binary data, let’s create a plot to better visualize what this means:

plot(Survived ~ UV.Level, data = Exposure)

scatter plot of UV level versus a zero or one indicator of whether or not death or survival occurred, respectively. With UV levels ranging from zero to one on the horizontal axis, all values for survival are zero except for two observations where UV levels were just above 0.2 units and close to 0.4 units.

From a visual inspection of the plot it appears as though there are deaths resulting from a wide variety of UV levels while there are only a couple that survived at lower levels of UV exposure. Therefore, this dataset suggests that UV level isn’t likely to have a significant relationship with the probability of survival.

Next, let’s use the glm() function to model the probability of survival as a function of UV level and use the resulting inference to guide our decision.

glmFit <- glm(Survived ~ UV.Level, data = Exposure, family = binomial)
summary(glmFit)

##
## Call:
## glm(formula = Survived ~ UV.Level, family = binomial, data = Exposure)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4636  -0.4634  -0.4608  -0.4518   2.1482  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.17657    1.01039  -2.154   0.0312 *
## UV.Level    -0.06476    2.16039  -0.030   0.9761  
## ===
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13.003  on 19  degrees of freedom
## Residual deviance: 13.002  on 18  degrees of freedom
## AIC: 17.002
## 
## Number of Fisher Scoring iterations: 4

The results seem to support our initial visual inspection of the data. With the parameter estimate for UV level having a p-value = 0.9761, the effect appears insignificant in explaining the variation in the probability of survival.

Need a Refresher?

Go back to the beginner tutorials.