Logistic Regression

Classification

April 23, 2024

Binary Outcomes

Link to Code

Please find the code here

Binary Outcomes

So far we have looked at continuous or numerical outcomes (response variables)
We are often also interested in outcome variables that are binary (Yes/No, or 1/0)
- Did violence happen, or not?
- Classification: is this email spam?

Example: Conflict Onset

Did a civil war begin in a given country in a given year? (yes/no)
Predictors: wealth, democracy, terrain, ethnic diversity, etc.
Seminal work by Fearon and Laitin (2003)
We can use logistic regression to model this binary outcome

Modeling

We can treat each outcome (conflict onset) as successes and failures arising from separate Bernoulli trials
Bernoulli trial: a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted
Success is usually coded as 1, failure as 0
So ironically, conflict onset is a “success” in this context

Modeling

Each Bernoulli trial can have a separate probability of success

\[ y_i ∼ Bern(p) \]

Modeling

We can then use the predictor variables to model that probability of success, $p_i$
We can’t really use a linear model for $p_i$ (since $p_i$ must be between 0 and 1) but we can transform the linear model to have the appropriate range

Generalized Linear Models

This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
Logistic regression is a very common example

GLMs

All GLMs have the following three characteristics:

A probability distribution describing a generative model for the outcome variable
A linear model: \[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
A link function that relates the linear model to the parameter of the outcome distribution

Logistic Regression

Logistic regression is a GLM used to model a binary categorical outcome (0 or 1)
In logistic regression, the link function that connects $\eta_i$ to $p_i$ is the logit function
Logit function: For $0\le p \le 1$

\[logit(p) = \log\left(\frac{p}{1-p}\right)\]

Logit Function

Logistic Regression Model

$y_i \sim \text{Bern}(p_i)$
$\eta_i = \beta_0+ \beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}$
$\text{logit}(p_i) = \eta_i$

Logistic Regression Model

$\text{logit}(p_i) = \eta_i = \beta_0+ \beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}$
Now take inverse logit to get $p$

\[p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}\]

Analyzing Conflict Onset

The `peacesciencer` Package

The peacesciencer package provides a number of datasets and functions for analyzing conflict and peace
Provides data from a number of important datasets in the field of conflict studies, e.g.
- Correlates of War (CoW) project
- Uppsala Conflict Data Program (UCDP)
- Militarized Interstate Dispute (MID) dataset
Provides functions for analyzing conflict and adding control variables to the dataset

Using the `peacesciencer` Package

library(peacesciencer)
library(tidymodels)

conflict_df <- create_stateyears(system = 'gw') |>
  filter(year %in% c(1946:1999)) |>
  add_ucdp_acd(type=c("intrastate"), only_wars = FALSE) |>
  add_democracy() |>
  add_creg_fractionalization() |>
  add_sdp_gdp() |>
  add_rugged_terrain()

glimpse(conflict_df)

Rows: 7,036
Columns: 20
$ gwcode         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ statename      <chr> "United States of America", "United States of America",…
$ year           <dbl> 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1…
$ ucdpongoing    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ucdponset      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ maxintensity   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ conflict_ids   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ v2x_polyarchy  <dbl> 0.605, 0.587, 0.599, 0.599, 0.587, 0.602, 0.601, 0.594,…
$ polity2        <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ xm_qudsest     <dbl> 1.259180, 1.259180, 1.252190, 1.252190, 1.270106, 1.259…
$ ethfrac        <dbl> 0.2226323, 0.2248701, 0.2271561, 0.2294918, 0.2318781, …
$ ethpol         <dbl> 0.4152487, 0.4186156, 0.4220368, 0.4255134, 0.4290458, …
$ relfrac        <dbl> 0.4980802, 0.5009111, 0.5037278, 0.5065309, 0.5093204, …
$ relpol         <dbl> 0.7769888, 0.7770017, 0.7770303, 0.7770729, 0.7771274, …
$ wbgdp2011est   <dbl> 28.539, 28.519, 28.545, 28.534, 28.572, 28.635, 28.669,…
$ wbpopest       <dbl> 18.744, 18.756, 18.781, 18.804, 18.821, 18.832, 18.848,…
$ sdpest         <dbl> 28.478, 28.456, 28.483, 28.469, 28.510, 28.576, 28.611,…
$ wbgdppc2011est <dbl> 9.794, 9.762, 9.764, 9.730, 9.752, 9.803, 9.821, 9.857,…
$ rugged         <dbl> 1.073, 1.073, 1.073, 1.073, 1.073, 1.073, 1.073, 1.073,…
$ newlmtnest     <dbl> 3.214868, 3.214868, 3.214868, 3.214868, 3.214868, 3.214…

Running a Logistic Regression

Implementation is not very different from a linear model
We just need to update our code to run a GLM
- specify the model with logistic_reg()
- use "glm" instead of "lm" as the engine
- define family = "binomial" for the link function to be used in the model

Bivariate Logistic Regression

conflict_model <- logistic_reg() |>
  set_engine("glm") |>
  fit(factor(ucdponset) ~ wbgdppc2011est,
                  data= conflict_df,
                  family = "binomial")

tidy(conflict_model)

# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      -1.16     0.426      -2.73 6.37e- 3
2 wbgdppc2011est   -0.331    0.0526     -6.29 3.20e-10

Interpreting the Results

\[\log\left(\frac{p}{1-p}\right) = -1.16-0.33\times \text{logGDPpc}\]

Interpreting the Results

For a quick interpretation of the coefficients, we can exponentiate them
The exponentiated coefficient is the odds ratio
For each one-unit increase in the independent variable, the odds of the outcome occurring increase (or decrease) by a factor of the exponentiated coefficient

Interpreting the Results

\[\log\left(\frac{p}{1-p}\right) = -1.16-0.33\times \text{logGDPpc}\]

For each one unit increase in log GDP per capita, the odds of the outcome occurring are multiplied by approximately 0.718, assuming other variables in the model are held constant.

This means that an increase in GDP per capita is associated with a decrease in the odds of the outcome occurring. The odds of the outcome decrease by about 28.2% for each unit increase in GDP per capita (on average).

Your Turn!

Run a bivariate logistic regression using ucdp onset as the outcome variable
First replicate the results using GDP per capita as the predictor
Now try a different predictor
Interpret the results
- What is the average effect of the predictor on conflict onset?
- How do you interpret that effect in terms of the odds of conflict onset?

10:00

Calculating Predicted Probabilities

Probability of conflict onset for a country with a log per capita GDP of 9 (about $8,000):

\[\log\left(\frac{p}{1-p}\right) = -1.16-0.33\times 9\] \[\log\left(\frac{p}{1-p}\right) = -4.13\]

\[\frac{p}{1-p} = \exp(-4.13)\]

\[\frac{p}{1-p} = 0.016\]

\[p = 0.016 \times (1 - p)\] \[p = 0.016 - 0.016p\]

\[1.016p = 0.016\] \[p = 0.016 / 1.016\] \[p = 0.0158\]

Using `marginaleffects`

# load the marginaleffects library
library(marginaleffects)

# select some countries for a given year
selected_countries <- conflict_df |>
  filter(
    statename %in% c("United States of America", "Venezuela", "Rwanda"),
    year == 1999)

# extract the model
conflict_fit <- conflict_model$fit

# calculate margins for the subset
marg_effects <- predictions(conflict_fit, newdata = selected_countries)

# tidy the results
tidy(marg_effects) |>
  select(estimate, p.value, conf.low, conf.high, statename)

Using `marginaleffects`

# A tibble: 3 × 5
  estimate   p.value conf.low conf.high statename               
     <dbl>     <dbl>    <dbl>     <dbl> <chr>                   
1  0.00853 4.20e-161  0.00606    0.0120 United States of America
2  0.0141  0          0.0113     0.0175 Venezuela               
3  0.0311  1.36e-250  0.0256     0.0377 Rwanda

Your Turn!

Select your favorite three countries and a recent year
Calculate the predicted proability of conflict onset for that year using the marginal effects package
If you have time, try to do the calcualation by hand as well

10:00

Logistic Regression

Binary Outcomes

Link to Code

Binary Outcomes

Example: Conflict Onset

Modeling

Modeling

Modeling

Generalized Linear Models

GLMs

Logistic Regression

Logit Function

Logistic Regression Model

Logistic Regression Model

Analyzing Conflict Onset

The peacesciencer Package

Using the peacesciencer Package

Running a Logistic Regression

Bivariate Logistic Regression

Interpreting the Results

Interpreting the Results

Interpreting the Results

Your Turn!

Calculating Predicted Probabilities

Using marginaleffects

Using marginaleffects

Your Turn!

The `peacesciencer` Package

Using the `peacesciencer` Package

Using `marginaleffects`

Using `marginaleffects`