Linear Regression

April 23, 2024

Linear Model with Single Predictor


Goal: Estimate Democracy score (\(\hat{Y_{i}}\)) of a country given level of GDP per capita (\(X_{i}\)).


Or: Estimate relationship between GDP per capita and democracy.

Linear Model with Single Predictor

Estimate Model using Tidymodels


Step 1: Specify model


linear_reg()


Step 2: Set model fitting engine


linear_reg() |>
  set_engine("lm") # lm: linear model


Step 3: Fit model & estimate parameters

… using formula syntax

linear_reg() |>
  set_engine("lm") |>
  fit(lib_dem ~ log_wealth, data = modelData) 
parsnip model object


Call:
stats::lm(formula = lib_dem ~ log_wealth, data = data)

Coefficients:
(Intercept)   log_wealth  
     0.1327       0.1197  


Step 4: Tidy things up…


\[\widehat{Democracy}_{i} = 0.13 + 0.12 * {loggdppc}_{i}\]

linear_reg() |>
  set_engine("lm") |>
  fit(lib_dem ~ log_wealth, data = modelData) |>
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.133    0.0380      3.49 6.04e- 4
2 log_wealth     0.120    0.0147      8.16 6.97e-14

Interpretation?


\[\widehat{Democracy}_{i} = 0.13 + 0.12 * {loggdppc}_{i}\]

Question


How do we get the “best” values for the slope and intercept?

How would you draw the “best” line?

How would you draw the “best” line?

Least squares regression


  • Remember the residual is the difference between the actual value and the predicted value
  • The regression line minimizes the sum of squared residuals.

Least squares regression


  • Residual for each point is: \(e_i = y_i - \hat{y}_i\)

  • Least squares regression line minimizes \(\sum_{i = 1}^n e_i^2\).

  • Why do we square the residual?
  • Why not take absolute value?

    • Principle: larger penalty for residuals further away
    • Math: makes the math easier and some nice properties (not our concern here…)

Least squares regression

Very Simple Example

What should the slope and intercept be?

Example

\(\hat{Y} = 0 + 1*X\)

Example

What is the sum of squared residuals?

Example

What is sum of squared residuals for \(y = 0 + 0*X\)?

Example

What is sum of squared residuals for \(y = 0 + 0*X\)?

(1-0)^2 + (2-0)^2 + (3-0)^2
[1] 14

Example

What is sum of squared residuals for \(y = 0 + 2*X\)?

Example

What is sum of squared residuals for \(y = 0 + 2*X\)?

(1-2)^2 + (2-4)^2 + (3-6)^2
[1] 14

One more…

What is sum of squared residuals for \(y = 0 + -1*X\)?

One more…

What is sum of squared residuals for \(y = 0 + -1*X\)?

(1+1)^2 + (2+2)^2 + (3+3)^2
[1] 56

Cost Function

Sum of Squared Residuals as function of possible values of \(b\)

Least Squares Regression


  • When we estimate a least squares regression, it is looking for the line that minimizes sum of squared residuals

  • In the simple example, I set \(a=0\) to make it easier. More complicated when searching for combination of \(a\) and \(b\) that minimize, but same basic idea

Least Squares Regression


  • There is a way to solve for this analytically for linear regression (i.e., by doing math…)

    – They made us do this in grad school…

  • In machine learning, people also use gradient descent algorithm in which the computer searches over possible combinations of \(a\) and \(b\) until it settles on the lowest point.

Least Squares Regression

Least Squares Regression

Your Turn


Are democracies less corrupt?


  • V-Dem includes a Political Corruption Index, which aggregates corruption in a number of spheres (see codebook for details).

  • The variable name is: v2x_corr : lower values mean less corruption

  • See started code HERE

Your Turn


Are democracies less corrupt?


  • Filter the V-Dem data to only include the year 2019
  • Make a scatterplot to visualize the relationship between democracy (X) and corruption (Y) (use the v2x_libdem variable for democracy)
  • Fit a linear model
  • Interpret results for the slope and intercept
  • For a country with the average (mean) level of democracy, what is the predicted level of corruption?
10:00

Create Your Own Model


  • What is a theory that you would like to test with V-Dem data?
  • What is the dependent variable?
  • What is the independent variable?
  • Map out steps to wrangle the data and fit a regression model
  • What do you expect to find?
  • Now go ahead and wrangle the data
  • Fit the model
  • Interpret the coefficients and their significance
  • Did the results match your expectations?