Multiple Regression

April 23, 2024

Load packages

library(tidyverse)
library(tidymodels)
library(vdemdata)

Load VDEM Data

modelData <- vdem |>
  filter(year == 2006) |> 
  select(country_name, 
         libdem = v2x_libdem, 
         wealth = e_gdppc, 
         oil_rents = e_total_oil_income_pc,
         polarization = v2cacamps, 
         corruption = v2x_corr, 
         judicial_review = v2jureview_ord, 
         region = e_regionpol_6C, 
         regime = v2x_regime) |> 
  mutate(log_wealth = log(wealth), 
         region = factor(
           region,
           labels=c("Eastern Europe", 
             "Latin America", 
             "MENA", 
             "SSAfrica", 
             "Western Europe and North America", 
             "Asia and Pacific"))
          )

glimpse(modelData)

Rows: 177
Columns: 10
$ country_name    <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana"…
$ libdem          <dbl> 0.482, 0.672, 0.888, 0.845, 0.640, 0.659, 0.768, 0.018…
$ wealth          <dbl> 14.584, 10.008, 42.378, 48.410, 3.284, 10.580, 37.068,…
$ oil_rents       <dbl> 694.847, 639.506, 0.000, 0.000, 6.380, 10.060, 2.635, …
$ polarization    <dbl> -0.346, -1.748, -2.346, -1.711, -0.398, -0.029, -2.273…
$ corruption      <dbl> 0.611, 0.226, 0.004, 0.023, 0.640, 0.400, 0.107, 0.892…
$ judicial_review <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
$ region          <fct> Latin America, Latin America, Western Europe and North…
$ regime          <dbl> 2, 2, 3, 3, 3, 3, 3, 0, 1, 2, 1, 1, 2, 3, 2, 3, 3, 2, …
$ log_wealth      <dbl> 2.6799250, 2.3033848, 3.7466294, 3.8797064, 1.1890622,…

Linear Model with Multiple Predictors

Previously, we were interested in GDP per capita as a predictor of democracy
Now, let’s consider another predictor: polarization (also measured by V-Dem)

Polarization Measure in the USA

Polarization and Democracy in 2006

Model with One Predictor

linear_reg() |>
  set_engine("lm") |>
  fit(libdem ~ polarization, data = modelData) |> 
  tidy()

# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.383     0.0187     20.5  2.68e-48
2 polarization  -0.0882    0.0140     -6.29 2.45e- 9

How do we interpret the intercept and the slope estimate?
Significance?

Model with Two Predictors

linear_reg() |>
  set_engine("lm") |>
  fit(libdem ~ polarization + log_wealth, data = modelData) |> 
  tidy()

# A tibble: 3 × 5
  term         estimate std.error statistic       p.value
  <chr>           <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)    0.200     0.0344      5.80 0.0000000318 
2 polarization  -0.0550    0.0139     -3.96 0.000111     
3 log_wealth     0.0952    0.0152      6.24 0.00000000334

Model with Two Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc\]

\[\hat{Y_i} = 0.18 + -0.05*Polarization + 0.10*GDPpc\]

Model with Two Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc\]

\[\hat{Y_i} = 0.18 + -0.05*Polarization + 0.10*GDPpc\]

\(a\) is the predicted level of Y when BOTH GDP per capita and polarization are equal to 0

Model with Two Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc\]

\[\hat{Y_i} = 0.18 + -0.05*Polarization + 0.10*GDPpc\]

\(b_1\) is the impact of a 1-unit change in polarization on the predicted level of Y, holding GDP per capita fixed (all else equal)
The relationship between polarization and democracy, controlling for wealth

Model with Two Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc\]

\[\hat{Y_i} = 0.18 + -0.05*Polarization + 0.10*GDPpc\]

\(b_2\) is the impact of a 1-unit change in GDP per capita on the predicted level of Y, holding polarization fixed (all else equal)
The relationship between wealth and democracy, controlling for polarization

Model with Two Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc\]

OLS is searching for combination of \(a\), \(b_1\), and \(b_2\) that minimize sum of squared residuals
Same logic as model with one predictor, just more complicated

Model with Three Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc + b_3*OilRents\]

Model with Three Predictors

linear_reg() |>
  set_engine("lm") |>
  fit(libdem ~ polarization + log_wealth + oil_rents, data = modelData) |> 
  tidy()

# A tibble: 4 × 5
  term           estimate  std.error statistic  p.value
  <chr>             <dbl>      <dbl>     <dbl>    <dbl>
1 (Intercept)   0.157     0.0322          4.86 2.78e- 6
2 polarization -0.0538    0.0130         -4.13 5.87e- 5
3 log_wealth    0.132     0.0152          8.67 4.76e-15
4 oil_rents    -0.0000412 0.00000615     -6.70 3.39e-10

Model with Three Predictors

\[\hat{Y_i} = a + b_1*Polarization + b_2*GDPpc + b_3*OilRents\]

\[\hat{Y_i} = a + -.05*Polarization + .13*GDPpc + .00004*OilRents\]

\(b_3\) is the impact of a 1-unit change in oil revenues per capita on the predicted level of Y, holding polarization and GDP fixed (all else equal)

Your Turn!

In the last session, you examined levels of democracy and corruption
Now, let’s fit a multiple regression model predicting corruption with two predictors: democracy (libdem) and polarization (polarization)
Then, interpret the coefficients
Estimate a second multiple regression model that adds in GDP per capita (lg_gdppc)
Interpret the coefficients
What happens to the coefficients of the other variables when we add GDP per capita to the model?
- Why do you think this happens?
Try adding in additional predictors if there is time

10:00

Model 1

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.846     0.0268     31.5  7.42e-74
2 libdem        -0.760     0.0588    -12.9  3.42e-27
3 polarization   0.0495    0.0121      4.10 6.33e- 5

Model 2 - Adding GDP per capita

# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.954     0.0283     33.8  5.21e-77
2 libdem        -0.602     0.0576    -10.5  4.59e-20
3 polarization   0.0332    0.0109      3.04 2.71e- 3
4 log_wealth    -0.0846    0.0127     -6.67 3.49e-10

Categorical Variables

Judicial Review and Democracy

Judicial Review:

Do high courts (Supreme Court, Constitutional Court, etc) have the power to rule on whether laws or policies are constitutional/legal? (Yes or No)
Dimension of Judicial Independence

Judicial Review and Democracy

linear_reg() |>
  set_engine("lm") |>
  fit(libdem ~ factor(judicial_review), data = modelData) |>
  tidy()

# A tibble: 2 × 5
  term                     estimate std.error statistic     p.value
  <chr>                       <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)                 0.176    0.0480      3.67 0.000323   
2 factor(judicial_review)1    0.275    0.0521      5.27 0.000000390

Judicial Review and Democracy

\[\widehat{Democracy_{i}} = 0.17 + 0.28*JudicialReview(yes)\]

Slope: countries with judicial review are expected, on average, to be 0.28 units more democratic on the liberal democracy index
- Compares baseline level (Judicial Review = 0) to the other level (Judicial Review = 1)
Intercept: average democracy score of countries without judicial review
Average democracy score of countries with judicial review is 0.17 + 0.28 = 0.45

Dummy Variables

When the categorical explanatory variable has many levels, they’re encoded to dummy variables
We always leave one category out of the model, as the omitted reference category
Each coefficient describes is the expected difference between level of the factor and the baseline level
Everything is relative to the omitted reference category

Democracy and World Region

Does region predict levels of democracy?
Since Eastern Europe is the first category, default in R is to use that as the omitted category in models.

levels(modelData$region)

[1] "Eastern Europe"                   "Latin America"                   
[3] "MENA"                             "SSAfrica"                        
[5] "Western Europe and North America" "Asia and Pacific"

Democracy and World Region

How should we interpret intercept? How about the coefficient on Latin America?

linear_reg() |>
  set_engine("lm") |>
  fit(libdem ~ region, data = modelData) |>
  tidy()

# A tibble: 6 × 5
  term                                   estimate std.error statistic  p.value
  <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                              0.436     0.0362     12.0  1.54e-24
2 regionLatin America                      0.0685    0.0537      1.28 2.04e- 1
3 regionMENA                              -0.235     0.0573     -4.11 6.16e- 5
4 regionSSAfrica                          -0.138     0.0458     -3.00 3.07e- 3
5 regionWestern Europe and North America   0.373     0.0543      6.87 1.15e-10
6 regionAsia and Pacific                  -0.134     0.0521     -2.57 1.10e- 2

Democracy and World Region

What if you want a different baseline category? How do we interpret now?

# make SS Africa the reference category
modelData <- modelData |> 
mutate(newReg = relevel(region, ref=4)) 

linear_reg() |>
      set_engine("lm") |>
      fit(libdem ~ newReg, data = modelData) |>
      tidy()

# A tibble: 6 × 5
  term                                   estimate std.error statistic  p.value
  <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                             0.298      0.0281   10.6    1.44e-20
2 newRegEastern Europe                    0.138      0.0458    3.00   3.07e- 3
3 newRegLatin America                     0.206      0.0486    4.24   3.63e- 5
4 newRegMENA                             -0.0977     0.0525   -1.86   6.44e- 2
5 newRegWestern Europe and North America  0.511      0.0493   10.4    7.57e-20
6 newRegAsia and Pacific                  0.00351    0.0468    0.0750 9.40e- 1

Your Turn

Which types of regime have more corruption?

V-Dem also includes a categorial regime variable: Closed autocracy (0), Electoral Autocracy (1), Electoral Democracy (2), Liberal Democracy (3)

Your Turn

Which types of regime have more corruption?

First, let’s make this an easier factor variable to work with.

# Make nicer regime factor variable
modelData <- modelData |> 
  mutate(regime = factor(regime,
                         labels = c("Closed Autocracy",
                                   "Electoral Autocracy",
                                  "Electoral Democracy",
                                  "Liberal Democracy")))
levels(modelData$regime)

[1] "Closed Autocracy"    "Electoral Autocracy" "Electoral Democracy"
[4] "Liberal Democracy"

Your Turn

Which types of regime have more corruption?

Filter data to include only the year 2019 (or run the code to use modelData)
Make a plot to visualize the relationship between regime type and level of corruption. - Which kind of plot is best in this situation?
Fit a linear model
Interpret the intercept and coefficients

10:00

Visualization

Model

# A tibble: 4 × 5
  term                      estimate std.error statistic  p.value
  <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                 0.597     0.0395     15.1  1.93e-33
2 regimeElectoral Autocracy   0.148     0.0470      3.14 1.99e- 3
3 regimeElectoral Democracy  -0.0580    0.0486     -1.19 2.35e- 1
4 regimeLiberal Democracy    -0.459     0.0497     -9.23 9.63e-17

Create Your Own Model

What is a theory that you would like to test with V-Dem data?
What is the dependent variable?
What are the independent variables?
Map out steps to wrangle the data and fit a regression model
What do you expect to find?
Now go ahead and wrangle the data
Fit the model
Interpret the coefficients and their significance
Did the results match your expectations?