Module 5.2

Multiple Regression

Overview

In the previous lesson, we discussed simple linear regression, which is a statistical method that allows us to model the relationship between a dependent variable and a single independent variable. In this lesson, we will discuss multiple linear regression, which is an extension of simple linear regression that allows us to model the relationship between a dependent variable and multiple independent variables. To get you started, here is a video where Andrew Ng explains the logic of multiple linear regression.

Multiple Regression

Mulitple linear regression is an extension of simple linear regression that allows us to model the relationship between a dependent variable and multiple independent variables. Extending on our earlier bivariate model, we can now include multiple predictors in our model. We would represent this in the following equation:

\[Y = a + b_1X_1 + b_2X_2 + \ldots + b_nX_n\]

Where \(Y\) is the dependent variable, \(a\) is the intercept, \(b_1\) through \(b_n\) are the coefficients, and \(X_1\) through \(X_n\) are the independent variables.

To solve this model, we use the same method of least squares to minimize the sum of squared residuals, except in this case, OLS is searching for combination of \(a\), \(b_1\), \(b_2\)…\(b_n\) that minimize sum of squared residuals across all predictors. So it is the same logic, just a bit more complicated.

To make things a little more concrete, let’s build up an example using the V-Dem data. Let’s first load the relevant packages and grab the data from V-Dem. We will use data from 2006 because we want a nice cross-section and it has data for all of the variables that we will be using in our analysis.

library(tidyverse)
library(vdemdata)

model_data <- vdem |>
  filter(year == 2006) |> 
  select(country_name, 
         libdem = v2x_libdem, 
         wealth = e_gdppc, 
         oil_rents = e_total_oil_income_pc,
         polarization = v2cacamps, 
         corruption = v2x_corr, 
         judicial_review = v2jureview_ord, 
         region = e_regionpol_6C, 
         regime = v2x_regime) |> 
  mutate(log_wealth = log(wealth), 
         region = factor(
           region,
           labels=c("Eastern Europe", 
             "Latin America", 
             "MENA", 
             "SSAfrica", 
             "Western Europe and North America", 
             "Asia and Pacific"))
          )

glimpse(model_data)

Rows: 177
Columns: 10
$ country_name    <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana"…
$ libdem          <dbl> 0.480, 0.655, 0.882, 0.842, 0.633, 0.652, 0.776, 0.017…
$ wealth          <dbl> 14.584, 10.008, 42.378, 48.410, 3.284, 10.580, 37.068,…
$ oil_rents       <dbl> 694.847, 639.506, 0.000, 0.000, 6.380, 10.060, 2.635, …
$ polarization    <dbl> 0.102, -1.783, -2.254, -1.732, -0.615, 0.065, -2.261, …
$ corruption      <dbl> 0.608, 0.212, 0.004, 0.023, 0.630, 0.423, 0.108, 0.888…
$ judicial_review <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
$ region          <fct> Latin America, Latin America, Western Europe and North…
$ regime          <dbl> 2, 2, 3, 3, 3, 3, 3, 0, 1, 2, 1, 1, 2, 3, 2, 3, 3, 2, …
$ log_wealth      <dbl> 2.6799250, 2.3033848, 3.7466294, 3.8797064, 1.1890622,…

Previously, we were only interested in GDP per capita as a predictor of democracy. Now, let’s consider another predictor: political polarization (also measured by V-Dem). We can start with a simple scatter plot and a linear model to see the relationship between polarization and democracy cross-nationally.

ggplot(model_data, aes(x = polarization, y = libdem)) +
  geom_point() +
  geom_smooth(method = "lm", color = "#E48957", se = FALSE) +
  labs(x = "Polarization", y = "Liberal Democracy Index") +
  theme_minimal()

We can see that there appears to be quite a strong relationship between political polarization and democracy. Now, as with GDP per capita in the previous lesson, we can estimate the relationship between political polarization and democracy using a linear model to get the coefficients that define the line we see in the plot above. We want to estimate this equation:

\[lib\_dem = a + b \times polarization\]

model1 <-  lm(libdem ~ polarization, data = model_data) 

summary(model1)


Call:
lm(formula = libdem ~ polarization, data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59898 -0.16888 -0.02296  0.19157  0.55636 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.37588    0.01871  20.086  < 2e-16 ***
polarization -0.09137    0.01389  -6.579  5.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2405 on 175 degrees of freedom
Multiple R-squared:  0.1983,    Adjusted R-squared:  0.1937 
F-statistic: 43.28 on 1 and 175 DF,  p-value: 5.299e-10

We see that coefficient for polarization is -.088 and significant at the .0001 level. This means that for every one unit increase in polarization we should expect to see a .088 decrease in the liberay democracy score.

Now we can just add predictors to our model. Let’s add log wealth back in and see how it changes the relationship between polarization and democracy. The equation we are estimating here is:

\[lib\_dem = a + b_1 \times polarization + b_2 \times log(wealth)\]

model2 <- lm(libdem ~ polarization + log_wealth, data = model_data) 

summary(model2)


Call:
lm(formula = libdem ~ polarization + log_wealth, data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65363 -0.16094  0.04912  0.18077  0.42814 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.19571    0.03399   5.758 3.90e-08 ***
polarization -0.05837    0.01376  -4.242 3.63e-05 ***
log_wealth    0.09439    0.01512   6.242 3.34e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2196 on 170 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.347, Adjusted R-squared:  0.3393 
F-statistic: 45.17 on 2 and 170 DF,  p-value: < 2.2e-16

Now we can interpret the coefficients the same way as before, except we are doing so in the context of “all things being equal” or “holding other variables constant.” In this case, the coefficient for polarization is -.05 and again statistically significant. Thus we would say that holding GDP per capita fixed, a one unit change in political polaraization has a -.05 impact on the predicted level of democracy. Similarly, we could say that, holding polarization fixed, a one unit change in log GDP per capita increase the predicted democracy score by 0.095 units.

Now if we want to add yet another predictor, we can do so in the same fashion. Let’s say we are interested in the impact of oil rents on democracy. We can add this to our model and estimate the following equation:

\[lib\_dem = a + b_1 \times polarization + b_2 \times log(wealth) + b_3 \times oil\_rents\]

model3 <- lm(libdem ~ polarization + log_wealth + oil_rents, data = model_data)

summary(model3)


Call:
lm(formula = libdem ~ polarization + log_wealth + oil_rents, 
    data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54170 -0.14140  0.03997  0.13419  0.67472 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.531e-01  3.174e-02   4.823 3.30e-06 ***
polarization -5.774e-02  1.283e-02  -4.499 1.32e-05 ***
log_wealth    1.309e-01  1.501e-02   8.720 3.65e-15 ***
oil_rents    -4.133e-05  6.074e-06  -6.805 1.98e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1973 on 158 degrees of freedom
  (15 observations deleted due to missingness)
Multiple R-squared:  0.486, Adjusted R-squared:  0.4763 
F-statistic:  49.8 on 3 and 158 DF,  p-value: < 2.2e-16

Here the oil rents variable is negative and significant, suggesting that holding polarization and GDP fixed, a one unit change in oil rents per capita has a -.00004 unit impact on the predicted level of democracy. We would have to go back to the oil rents variable to understand the significance of this finding, but this gives you a basic sense of how to interpret the coefficients in a multiple regression model.

Categorical Variables

Another topic we want to cover is how to include categorical variables in a regression model. Recall that categorical variables are variables that take on a limited number of values. For example, whether a country has judicial review or not can be coded as a categorical variable. We can turn that variable into a dummy variable and include it in our model.

Let’s quickly visualize the relationship between wealth and democracy, incorporating judicial review as a factor variable that conditions the relationship between wealth and democracy. Note how we are using the factor() function to turn the judicial_review variable into a factor variable.

ggplot(model_data, aes(x = log_wealth, y = libdem, color=factor(judicial_review))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "GPD per capita", y = "Liberal Democracy Index") +
  theme_bw() +
    scale_color_manual(name = "Judicial Review", values = c("steelblue3", "coral"), labels = c("No", "Yes"))

We can see that the relationship between GDP per capita and democracy is different for countries with and without judicial review. We can estimate this relationship using a linear model. We do this by simply incorporating the categorical variable, judicial_review, into our model. Here is the equation we want to estimate:

\[lib\_dem = a + b_1 \times log(wealth) + b_2 \times judicial\_review\]

model4 <- lm(libdem ~ log_wealth + judicial_review, data = model_data)

summary(model4)


Call:
lm(formula = libdem ~ log_wealth + judicial_review, data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65748 -0.13532  0.01481  0.15409  0.48032 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.05389    0.04919  -1.096    0.275    
log_wealth       0.11709    0.01341   8.734 2.25e-15 ***
judicial_review  0.26304    0.04481   5.870 2.23e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2106 on 170 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.3996,    Adjusted R-squared:  0.3925 
F-statistic: 56.56 on 2 and 170 DF,  p-value: < 2.2e-16

We can now interpret the coefficient on judicial_review as the expected difference in the predicted level of democracy between countries with and without judicial review, holding GDP per capita fixed. In this case, the coefficient is .26, suggesting that holding GDP per capita fixed, countries with judicial review are expected to be .26 units more democratic as per V-Dem’s liberal democracy measure.

Note that the intercept also has an important interpretation. The intercept is the average democracy score of countries without judicial review. Thus, the average democracy score of countries with judicial review is -.05 + .12 + .26 = .33 and the average democracy score of countries without judicial review is -.05 + .12 = .07.

The judical review example here is an example of a dummy variable. Dummy variables are a way to include categorical variables in a regression model. Note that we always leave one category out of the model, as the omitted reference category. The coefficient on a dummy variable is then interpreted as the expected difference in the dependent variable between the category of the dummy variable and the omitted category. In this case, the omitted category is countries without judicial review.

Let’s try one more example. We will use the region variable to estimate the relationship between regions of the world and democracy. region will enter into the model as a factor variable. Noee that since Eastern Europe is the first category, default in R is to use that as the omitted category in models.

levels(model_data$region)

[1] "Eastern Europe"                   "Latin America"                   
[3] "MENA"                             "SSAfrica"                        
[5] "Western Europe and North America" "Asia and Pacific"

This means that when we run the regression, we will be interpreting the other region coefficients as the expected difference in the predicted level of democracy between that region and Eastern Europe, holding all other variables constant. And the coefficient will represent the value of the dependent variable for the omitted category (Eastern Europe).

model5 <- lm(libdem ~ region, data = model_data)

summary(model5)


Call:
lm(formula = libdem ~ region, data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45792 -0.13186 -0.02004  0.11414  0.49320 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             0.43450    0.03608  12.042  < 2e-16 ***
regionLatin America                     0.06642    0.05352   1.241  0.21629    
regionMENA                             -0.23570    0.05705  -4.131 5.63e-05 ***
regionSSAfrica                         -0.13946    0.04564  -3.056  0.00261 ** 
regionWestern Europe and North America  0.37554    0.05412   6.939 7.84e-11 ***
regionAsia and Pacific                 -0.13364    0.05193  -2.573  0.01092 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1976 on 171 degrees of freedom
Multiple R-squared:  0.4712,    Adjusted R-squared:  0.4557 
F-statistic: 30.47 on 5 and 171 DF,  p-value: < 2.2e-16

Here the coefficient on Latin America is .06. This entails that, holding all other variables constant, countries in Latin America are expected to be .06 units more democratic than countries in Eastern Europe. Similarly countries in the MENA region are expected to be -.23 units less democratic than countries in Eastern Europe. Etc. And finally the predicted level of democracy in Eastern Europe is represented by the value of the coefficient, .44.

What if you want a different baseline category? In that case, we can use the relevel() function to change the reference category. Let’s make Sub-Saharan Africa the reference category and re-estimate the model.

# make SS Africa the reference category
model_data_relevel <- model_data |> 
mutate(new_region = relevel(region, ref=4)) 

model6 <- lm(libdem ~ new_region, data = model_data_relevel)

summary(model6)


Call:
lm(formula = libdem ~ new_region, data = model_data_relevel)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45792 -0.13186 -0.02004  0.11414  0.49320 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                 0.295040   0.027950  10.556
new_regionEastern Europe                    0.139460   0.045641   3.056
new_regionLatin America                     0.205880   0.048410   4.253
new_regionMENA                             -0.096240   0.052289  -1.841
new_regionWestern Europe and North America  0.515002   0.049078  10.494
new_regionAsia and Pacific                  0.005817   0.046649   0.125
                                           Pr(>|t|)    
(Intercept)                                 < 2e-16 ***
new_regionEastern Europe                    0.00261 ** 
new_regionLatin America                    3.47e-05 ***
new_regionMENA                              0.06742 .  
new_regionWestern Europe and North America  < 2e-16 ***
new_regionAsia and Pacific                  0.90091    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1976 on 171 degrees of freedom
Multiple R-squared:  0.4712,    Adjusted R-squared:  0.4557 
F-statistic: 30.47 on 5 and 171 DF,  p-value: < 2.2e-16

Now we interpret all of the coefficients relative to Sub-Saharan Africa. For example, Eastern Europe is expected to be .138 units more democratic than Sub-Saharan Africa while Latin America is predicted to be .206 units higher. The predicted level of democracy in Sub-Saharan Africa for this year is represented by the constant–.298.

Challenge

The data we wrangled also includes a categorial regime variable: Closed autocracy (0), Electoral Autocracy (1), Electoral Democracy (2), Liberal Democracy (3) as well as a measure of corruption. Based on what you have learned, do you think you can use these data to answer the question of “which types of regimes have more corruption?”