Modeling

April 23, 2024

Modeling

Use models to explain the relationship between variables and to make predictions
Explaining relationships [usually interested in causal relationships, but not always]
- Does oil wealth impact regime type?
Predictive modeling
- Where is violence most likely to happen in [country X] during their next election?
- Is this email spam?

Modeling

Example: GDP per capita and Democracy

Pull in the VDEM Data

What is this code doing?

library(vdemdata)

modelData <- vdem |>
  filter(year == 2019) |> 
  select(
    country = country_name, 
    lib_dem = v2x_libdem, 
    wealth = e_gdppc) |>
  mutate(log_wealth = log(wealth))

glimpse(modelData)

Rows: 179
Columns: 4
$ country    <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana", "So…
$ lib_dem    <dbl> 0.433, 0.593, 0.875, 0.870, 0.614, 0.601, 0.754, 0.267, 0.1…
$ wealth     <dbl> 16.814, 11.752, 48.804, 56.110, 5.608, 11.345, 39.061, 5.69…
$ log_wealth <dbl> 2.8222119, 2.4640234, 3.8878123, 4.0273140, 1.7241941, 2.42…

Plot the Relationship

ggplot(modelData, aes(x = wealth, y = lib_dem)) +
  geom_point() +
  geom_smooth(method = "lm", color = "#E48957", se = FALSE) +
  labs(x = "GPD per capita", y = "Liberal Democracy Index") +
  theme_bw()

Using the Scales Package

ggplot(modelData, aes(x = wealth, y = lib_dem)) +
  geom_point() +
  geom_smooth(method = "lm", color = "#E48957", se = FALSE) +
  scale_x_log10(label = scales::label_dollar(suffix = "k")) +
  labs(
    title = "Wealth and Democracy, 2019",
    x = "GPD per capita", 
    y = "Liberal Democracy Index") +
  theme_bw()

Models as Functions

We can represent relationships between variables using functions
A function is a mathematical concept: the relationship between an output and one or more inputs
- Plug in the inputs and receive back the output
Example: The formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\).
- If \(x\) is \(5\), \(y\) is \(22\),
- \(y = 3 \times 5 + 7 = 22\)

Quant Lingo

Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis in the plot
- Dependent variable
- Outcome variable
- Y variable
Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis in the plot
- Independent variables
- Predictors

Linear model with one explanatory variable…

\(Y = a + bX\)
\(Y\) is the outcome variable
\(X\) is the explanatory variable
\(a\) is the intercept: the predicted value of \(Y\) when \(X\) is equal to 0
\(b\) is the slope of the line [remember rise over run!]

Quant Lingo

Predicted value: Output of the model function
- The model function gives the typical (expected) value of the response variable conditioning on the explanatory variables
- We often call this \(\hat{Y}\) to differentiate the predicted value from an observed value of Y in the data
Residuals: A measure of how far each case is from its predicted value (based on a particular model)
- Residual = Observed value (\(Y\)) - Predicted value (\(\hat{Y}\))
- How far above/below the expected value each case is

Residuals

Linear Model

\(\hat{Y} = a + b \times X\)

\(\hat{Y} = 0.13 + 0.12 \times X\)

Linear Model: Interpretation

\(\hat{Y} = a + b \times X\)
\(\hat{Y} = 0.13 + 0.12 \times X\)

What is the interpretation of our estimate of \(a\)?

\(\hat{Y} = 0.13 + 0.12 \times 0\)
\(\hat{Y} = 0.13\)

\(a\) is our predicted level of democracy when GDP per capita is 0.

Linear Model: Interpretation

\(\hat{Y} = a + b \times X\)
\(\hat{Y} = 0.13 + 0.12 \times X\)

What is interpretation of our estimate of \(b\)?

\(\hat{Y} = a + \frac{Rise}{Run} \times X\)
\(\hat{Y} = a + \frac{Change Y}{Change X} \times X\)

Linear Model: Interpretation

\(b = \frac{Change Y}{Change X}\)
\(0.12 = \frac{Change Y}{Change X}\)
\({Change Y} = 0.12 * {ChangeX}\)

When \(ChangeX = 1\):
\({Change Y = 0.12}\)

\(b\) is the predicted change in \(Y\) associated with a ONE unit change in X.

Linear Model: Interpretation

Is this the causal effect of GDP per capita on liberal democracy?

No! It is only the association…

To identify causality we need other methods (beyond the scope of this course).

Your Task

An economist is interested in the relationship between years of education and hourly wages. They estimate a linear model with estimates of \(a\) and \(b\) as follows:

\(\hat{Y} = 9 + 1.60*{YrsEdu}\)

1. Interpret \(a\) and \(b\)
2. What is the predicted hourly wage for those with 10 years of education?

Next step

Linear model with one predictor: \(Y = a + bX\)
For any given data…
How do we figure out what the best values are for \(a\) and \(b\)??