IAFF 6501–Spring 2024 Schedule

Overview

A very common use of tables in the social sciences is to present regression results. There are numerous packages available for presenting regression output. In this lesson, we are going to focus on one of them that I think is particularly good for both pdf and html output: modelsummary. modelsummary also includes a function (modelplot()) for plotting point estimates and confidence intervals. So part of the objective of this lesson is going to be to learn when you should use a table to present your regression results versus when you should use a plot.

We will be taking our example from the peace studies literature. We are going to download data using the peacesciencer package and use it to partially reproduce a famous analysis of conflict onset by Fearon and Laitin.

Run a regression model and display results with broom

We will start off by building a data frame for our analysis with the peacesciencer package. peacesciencer is designed to make standard analysis for conflict studies more convenient and includes many of the control variables that you would use to estimate the likelihood of conflict onset or duration.

To start, we call create_stateyears() to create a time-series data set for all available countries. We will specify system = 'gw' to denote the Gleditsch-Ward country coding system. Then we will filter for roughly the same years of Fearon and Laitin’s original analysis (1945-99) with the caveat that peacesciencer only has conflict data starting in 1946.

Then we add a bunch of data to the data frame using the various add_X functions available in the package. Here we add UCDP conflict data which includes our dependent variable for this analysis–conflict onset. Then we add measures of democracy, ethno-religious fractionalization, GDP and terrain.

library(peacesciencer)
library(dplyr)

conflict_df <- create_stateyears(system = 'gw') |>
  filter(year %in% c(1946:1999)) |>
  add_ucdp_acd(type=c("intrastate"), only_wars = FALSE) |>
  add_democracy() |>
  add_creg_fractionalization() |>
  add_sdp_gdp() |>
  add_rugged_terrain()

glimpse(conflict_df)

Rows: 7,036
Columns: 20
$ gwcode         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ statename      <chr> "United States of America", "United States of America",…
$ year           <dbl> 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1…
$ ucdpongoing    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ucdponset      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ maxintensity   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ conflict_ids   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ v2x_polyarchy  <dbl> 0.605, 0.587, 0.599, 0.599, 0.587, 0.602, 0.601, 0.594,…
$ polity2        <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ xm_qudsest     <dbl> 1.259180, 1.259180, 1.252190, 1.252190, 1.270106, 1.259…
$ ethfrac        <dbl> 0.2226323, 0.2248701, 0.2271561, 0.2294918, 0.2318781, …
$ ethpol         <dbl> 0.4152487, 0.4186156, 0.4220368, 0.4255134, 0.4290458, …
$ relfrac        <dbl> 0.4980802, 0.5009111, 0.5037278, 0.5065309, 0.5093204, …
$ relpol         <dbl> 0.7769888, 0.7770017, 0.7770303, 0.7770729, 0.7771274, …
$ wbgdp2011est   <dbl> 28.539, 28.519, 28.545, 28.534, 28.572, 28.635, 28.669,…
$ wbpopest       <dbl> 18.744, 18.756, 18.781, 18.804, 18.821, 18.832, 18.848,…
$ sdpest         <dbl> 28.478, 28.456, 28.483, 28.469, 28.510, 28.576, 28.611,…
$ wbgdppc2011est <dbl> 9.794, 9.762, 9.764, 9.730, 9.752, 9.803, 9.821, 9.857,…
$ rugged         <dbl> 1.073, 1.073, 1.073, 1.073, 1.073, 1.073, 1.073, 1.073,…
$ newlmtnest     <dbl> 3.214868, 3.214868, 3.214868, 3.214868, 3.214868, 3.214…

Now let’s go ahead and run the analysis. We will specify a logit model using the glm() function and specifying family = binomial(link = "logit"). We will store our model in an object called conflict_model. And from there we can use the tidy() function from the broom package to view the results.

Before doing this, though, try calling summary() from base R on the model. This provides us with a basic regression table and it is great insofar as we don’t want to do anything else with these estimates. Next, go ahead and call View() on the model object of just click on it to see what it looks like. You will notice that the results are stored in a complicated list format.

The tidy() function enables us to take the results from this list and store them in a “tibble”, which is the Tidyverse equivalent of a data frame. Once the results are stored like this, we can easily access the estimates for anything that we might want to do with them including combining the results of different models or displaying particular estimates in our document using all of the tools that we have learned in this course. We can also set conf.int = TRUE) as an argument in tidy() to create and store confidence intervals.

By default, tidy() returns p-values with large numbers of digits following the decimal point, making hard to tell whether the variables are significant. To fix this, we can pipe our tidy() output into a mutate_if() call and specify that we want numeric output to round to five decimal places.

library(broom)

conflict_model <- glm(ucdponset ~ ethfrac + relfrac + v2x_polyarchy + 
                        rugged + wbgdppc2011est + wbpopest,
                  data= conflict_df,
                  family = binomial(link="logit"))

# summary(conflict_model)

tidy_model <- conflict_model |>
  tidy(conf.int = TRUE) |>
  mutate_if(is.numeric, round, 5)

tidy_model

# A tibble: 7 × 7
  term           estimate std.error statistic p.value conf.low conf.high
  <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)     -5.69      1.41      -4.04  0.00005  -8.44      -2.92 
2 ethfrac          0.800     0.381      2.10  0.0356    0.0670     1.56 
3 relfrac         -0.391     0.417     -0.939 0.348    -1.22       0.420
4 v2x_polyarchy   -0.602     0.509     -1.18  0.237    -1.61       0.387
5 rugged           0.0641    0.0760     0.843 0.399    -0.0923     0.207
6 wbgdppc2011est  -0.372     0.121     -3.08  0.00204  -0.613     -0.140
7 wbpopest         0.293     0.0673     4.35  0.00001   0.162      0.426

How did we do relative to Fearon and Latin’s original analysis. Well, one thing that F&L were pretty certain about is that ethnic and religious fractionalization do not matter for conflict onset. But here we find a statistically significant relationship between these variables and conflict onset. But one thing we do find in common with their analysis is the importance of wealth and population. Both of these variables are significant in the expected direction. Wealthier countries experience less risk of conflict onset while more populous ones have a higher risk.

Run many regressions and display with modelsummary

broom is really great if we want to just run one regression and see the results in the context of a working document. But what if we want to display our results to other researchers? For this, we need to use a different package. One package that is really good at producing professional-looking tabels is modelsummary and one of its strongest features is the ability to combine multiple models into the same table while still allowing for substantial customization.

Let’s go ahead and store four models. In the first three, we will feature sets of predictors (ethnicity, democracy and terrain) and then a final model that includes all of our predictors. In each model, we will include wealth (GDP) and population as controls because we have a feeling that these are robust predictors of conflict onset.

ethnicity <- glm(ucdponset ~ ethfrac + relfrac + wbgdppc2011est + wbpopest, # store each model in an object
                  data = conflict_df,
                  family = binomial(link="logit"))

democracy <- glm(ucdponset ~ v2x_polyarchy + wbgdppc2011est +  wbpopest,
                  data = conflict_df,
                  family = binomial(link="logit"))

terrain <- glm(ucdponset ~ rugged + wbgdppc2011est + wbpopest ,
                  data = conflict_df,
                  family = binomial(link="logit"))

full_model <- glm(ucdponset ~ ethfrac + relfrac + v2x_polyarchy + rugged +
                        wbgdppc2011est + wbpopest,
                  data = conflict_df,
                  family = binomial(link="logit"))

Next we will store our models as a list with intuitive names that we can display as column headers in our regression table. Then we will store a new coefficient mapping where we rename our variables and change the order that they will appear in the table. We will also store a title and a reference.

models <- list("Ethnicity" = ethnicity,  # store list of models in an object
               "Democracy" = democracy, 
               "Terrain" = terrain, 
               "Full Model" = full_model)

coef_map <- c("ethfrac" = "Ethnic Frac",  # map coefficients
        "relfrac" = "Religions Frac",     #(change names and order)
        "v2x_polyarchy" = "Polyarchy",
        "rugged" = "Terrain",
        "wbgdppc2011est" = "Per capita GDP",
        "wbpopest" = "Population",
        "(Intercept)" = "Intercept")

caption = "Table 1: Predictors of Conflict Onset" # store caption
reference = "See appendix for data sources."      # store reference notes

Now we can call modelsummary(). The first argument is the list of models we want to display. Next we tell modelsummary that we want it to show stars for statistical significance (stars = TRUE). And in gof_map, we say that, of the many goodness of fit statistics available to us, we only want to include the number of observations. Finally we plug our title into title = and a source note into notes =.

library(modelsummary)

modelsummary(models,                      # display the table
             stars = TRUE,                # include stars for significance
             gof_map = c("nobs"),         # goodness of fit stats to include   
             coef_map = coef_map,         # coefficient mapping
             title = caption,             # title
             notes = reference)           # source note

tinytable_dh3egi7tvyg9tcfa3f7p

Table 1: Predictors of Conflict Onset
	Ethnicity	Democracy	Terrain	Full Model
+ p
See appendix for data sources.
Ethnic Frac	0.744*			0.800*
	(0.367)			(0.381)
Religions Frac	-0.481			-0.391
	(0.411)			(0.417)
Polyarchy		-0.228		-0.602
		(0.436)		(0.509)
Terrain			0.031	0.064
			(0.076)	(0.076)
Per capita GDP	-0.474***	-0.512***	-0.543***	-0.372**
	(0.104)	(0.108)	(0.092)	(0.121)
Population	0.282***	0.297***	0.299***	0.293***
	(0.067)	(0.051)	(0.050)	(0.067)
Intercept	-4.703***	-4.407***	-4.296***	-5.693***
	(1.327)	(1.205)	(1.143)	(1.408)
Num.Obs.	6364	6772	6840	6151

When a coefficient plot is better

A regression table is great when we have many models that we want to display. But what happens when we have just one model? We could present something like our earlier tidy() output which has the beta coefficient, the standard error, t-statistic and p-values in separate columns, but this would be unconventional and would take up a lot of space. Another option is just to present a table with one column like this:

modelsummary(conflict_model, 
             stars = TRUE,  
             gof_map = c("nobs"),
             coef_map = coef_map,
             title = caption, 
             notes = reference)

tinytable_g82qwe7d3tc4o4di1037

Table 1: Predictors of Conflict Onset
	(1)
+ p
See appendix for data sources.
Ethnic Frac	0.800*
	(0.381)
Religions Frac	-0.391
	(0.417)
Polyarchy	-0.602
	(0.509)
Terrain	0.064
	(0.076)
Per capita GDP	-0.372**
	(0.121)
Population	0.293***
	(0.067)
Intercept	-5.693***
	(1.408)
Num.Obs.	6151

But this is also somewhat unconventional and makes our regression output look a little bit lonely. A better option could be to display a coefficient plot that shows point estimates and confidence intervals. This is often preferable with one model because it makes our results so much easier to interpret.

Let’s try making a coefficient plot with modelplot() from the modelsummary package. The syntax for modelplot() is very similar to that of modelsummary() but there are a few small differences.

First, it puts maps the coefficients from the bottom up, so if you to maintain the order of the coefficients you need to reverse the mapping. We do this with the rev() function.

Second, we want to omit the intercept with coef_omit = "Intercept" because the emphasis in a coefficient plot is less on the exact regression equation and more on the magnitude and significance of the coefficients.

Third we can customize various things. We can specify the color of the points and confidence intervals and we can load ggplot2 for further customization. Now we can add geoms and labels just like any other ggplot object. Here we add a red vertical intercept line at zero to make it clearer which variables are significant (e.g. the confidence interval does not overlap with zero). And we add a title and a caption using the labs() function.

library(ggplot2)


modelplot(conflict_model, 
          coef_map = rev(coef_map), # rev() reverses list order
          coef_omit = "Intercept", 
          color = "blue") + # use plus to add customizations like any ggplot object
  geom_vline(xintercept = 0, color = "red", linetype = "dashed", linewidth = .75) + # red 0 line
  labs(
    title = "Figure 1: Predictors of Conflict Onset",
    caption = "See appendix for data sources."
  )