Linear Regression

April 23, 2024

Linear Model with Single Predictor

Goal: Estimate Democracy score (\(\hat{Y_{i}}\)) of a country given level of GDP per capita (\(X_{i}\)).

Or: Estimate relationship between GDP per capita and democracy.

Linear Model with Single Predictor

Estimate Model using Tidymodels

Step 1: Specify model

linear_reg()

Step 2: Set model fitting engine

linear_reg() |>
  set_engine("lm") # lm: linear model

Step 3: Fit model & estimate parameters

… using formula syntax

linear_reg() |>
  set_engine("lm") |>
  fit(lib_dem ~ log_wealth, data = modelData)

parsnip model object


Call:
stats::lm(formula = lib_dem ~ log_wealth, data = data)

Coefficients:
(Intercept)   log_wealth  
     0.1327       0.1197

Step 4: Tidy things up…

\[\widehat{Democracy}_{i} = 0.13 + 0.12 * {loggdppc}_{i}\]

linear_reg() |>
  set_engine("lm") |>
  fit(lib_dem ~ log_wealth, data = modelData) |>
  tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.133    0.0380      3.49 6.04e- 4
2 log_wealth     0.120    0.0147      8.16 6.97e-14

Interpretation?

\[\widehat{Democracy}_{i} = 0.13 + 0.12 * {loggdppc}_{i}\]

Question

How do we get the “best” values for the slope and intercept?

How would you draw the “best” line?

Least squares regression

Remember the residual is the difference between the actual value and the predicted value

The regression line minimizes the sum of squared residuals.

Least squares regression

Residual for each point is: \(e_i = y_i - \hat{y}_i\)
Least squares regression line minimizes \(\sum_{i = 1}^n e_i^2\).

Why do we square the residual?

Why not take absolute value?
- Principle: larger penalty for residuals further away
- Math: makes the math easier and some nice properties (not our concern here…)

Least squares regression

Very Simple Example

What should the slope and intercept be?

Example

\(\hat{Y} = 0 + 1*X\)

Example

What is the sum of squared residuals?

Example

What is sum of squared residuals for \(y = 0 + 0*X\)?

Example

What is sum of squared residuals for \(y = 0 + 0*X\)?

(1-0)^2 + (2-0)^2 + (3-0)^2

[1] 14

Example

What is sum of squared residuals for \(y = 0 + 2*X\)?

Example

What is sum of squared residuals for \(y = 0 + 2*X\)?

(1-2)^2 + (2-4)^2 + (3-6)^2

[1] 14

One more…

What is sum of squared residuals for \(y = 0 + -1*X\)?

One more…

What is sum of squared residuals for \(y = 0 + -1*X\)?

(1+1)^2 + (2+2)^2 + (3+3)^2

[1] 56

Cost Function

Sum of Squared Residuals as function of possible values of \(b\)

Least Squares Regression

When we estimate a least squares regression, it is looking for the line that minimizes sum of squared residuals
In the simple example, I set \(a=0\) to make it easier. More complicated when searching for combination of \(a\) and \(b\) that minimize, but same basic idea

Least Squares Regression

There is a way to solve for this analytically for linear regression (i.e., by doing math…)

– They made us do this in grad school…

In machine learning, people also use gradient descent algorithm in which the computer searches over possible combinations of \(a\) and \(b\) until it settles on the lowest point.

Least Squares Regression

Your Turn

Are democracies less corrupt?

V-Dem includes a Political Corruption Index, which aggregates corruption in a number of spheres (see codebook for details).
The variable name is: v2x_corr : lower values mean less corruption
See started code HERE

Your Turn

Are democracies less corrupt?

Filter the V-Dem data to only include the year 2019
Make a scatterplot to visualize the relationship between democracy (X) and corruption (Y) (use the v2x_libdem variable for democracy)
Fit a linear model
Interpret results for the slope and intercept
For a country with the average (mean) level of democracy, what is the predicted level of corruption?

10:00

Create Your Own Model

What is a theory that you would like to test with V-Dem data?
What is the dependent variable?
What is the independent variable?
Map out steps to wrangle the data and fit a regression model
What do you expect to find?
Now go ahead and wrangle the data
Fit the model
Interpret the coefficients and their significance
Did the results match your expectations?