Module 2.1

Working with and Summarizing Data

Prework

Generate a Quarto document named module-2.1.qmd in your modules project folder so that you can code along with the lesson.

Overview

In this module we learn how to download, wrangle, and summarize data. We will work with the V-Dem Dataset, a rich source of democracy-related indicators covering countries from 1789 to the present. Along the way we cover the core dplyr verbs for data manipulation: filter(), select(), mutate(), group_by(), summarize(), and arrange().

Downloading and Transforming Data

The vdemdata package gives us direct access to the V-Dem dataset in R. Its main object, vdem, contains the entire dataset — over 4,000 variables and nearly 30,000 country-year observations. Because it downloads everything at once, we rely on dplyr functions to narrow it down to just what we need.

Note

V-Dem has a variable look-up tool (find_var), but for details on each indicator it is best to consult the V-Dem codebook directly.

We will focus on four variables: the polyarchy score (v2x_polyarchy), GDP per capita (e_gdppc), region (e_regionpol_6C), and the women’s empowerment index (v2x_gender). We also keep country_name, country_id, and year for identification.

The filter() verb keeps only the rows we want. The select() verb keeps only the columns we want — and lets us rename them at the same time. Finally, mutate() with case_match() recodes the numeric region codes into readable labels.

library(tidyverse)
library(vdemdata)

democracy <- vdem |>
  filter(year >= 1990) |>
  select(
    country = country_name,
    vdem_ctry_id = country_id,
    year,
    polyarchy = v2x_polyarchy,
    gdp_pc = e_gdppc,
    women_emp = v2x_gender,
    region = e_regionpol_6C
  ) |>
  mutate(
    region = case_match(region,
      1 ~ "Eastern Europe",
      2 ~ "Latin America",
      3 ~ "Middle East",
      4 ~ "Africa",
      5 ~ "The West",
      6 ~ "Asia")
  )

glimpse(democracy)

Rows: 6,383
Columns: 7
$ country      <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexico…
$ vdem_ctry_id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ year         <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 199…
$ polyarchy    <dbl> 0.388, 0.410, 0.438, 0.455, 0.467, 0.491, 0.512, 0.555, 0…
$ gdp_pc       <dbl> 24.396, 25.077, 25.561, 25.967, 26.101, 25.434, 25.851, 2…
$ women_emp    <dbl> 0.528, 0.499, 0.499, 0.509, 0.635, 0.631, 0.631, 0.631, 0…
$ region       <chr> "Latin America", "Latin America", "Latin America", "Latin…

Your Turn

Change the start year from 1990 to a year of your choosing.
Add one more V-Dem variable to the select() call. Look up a variable code in the codebook that interests you and give it a clean name.
Run glimpse() on your new data frame. How many rows and columns does it have?

Summarizing Data

A common workflow in data science is group → summarize → arrange. We group the data by a categorical variable, calculate summary statistics for each group, and then sort the results.

We will illustrate this with dem_women.csv, a pre-processed dataset combining V-Dem democracy indicators with economic and gender representation variables.

dem_women <- read_csv("data/dem_women.csv")

glimpse(dem_women)

Rows: 5,846
Columns: 9
$ country      <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexico…
$ vdem_ctry_id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ iso3c        <chr> "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "MEX", "…
$ year         <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 199…
$ polyarchy    <dbl> 0.396, 0.418, 0.441, 0.451, 0.472, 0.489, 0.511, 0.560, 0…
$ gdp_pc       <dbl> 11.389, 11.635, 11.883, 11.983, 12.043, 11.742, 12.059, 1…
$ region       <chr> "Latin America", "Latin America", "Latin America", "Latin…
$ women_rep    <dbl> NA, NA, NA, NA, NA, NA, NA, 14.20, 17.40, 18.20, 16.00, 1…
$ flfp         <dbl> 33.94, 34.24, 35.01, 35.85, 36.38, 37.62, 37.69, 39.65, 3…

Now let’s calculate regional averages for all four variables and sort by polyarchy score in descending order:

dem_summary <- dem_women |>
  group_by(region) |>
  summarize(
    polyarchy   = mean(polyarchy, na.rm = TRUE),
    gdp_pc      = mean(gdp_pc, na.rm = TRUE),
    flfp        = mean(flfp, na.rm = TRUE),
    women_rep   = mean(women_rep, na.rm = TRUE)
  ) |>
  arrange(desc(polyarchy))

dem_summary

# A tibble: 6 × 5
  region         polyarchy gdp_pc  flfp women_rep
  <chr>              <dbl>  <dbl> <dbl>     <dbl>
1 The West           0.871  37.9   53.0      28.1
2 Latin America      0.637   9.61  48.1      21.3
3 Eastern Europe     0.539  12.2   50.5      18.0
4 Asia               0.408   9.75  50.3      14.5
5 Africa             0.393   4.41  56.7      17.4
6 Middle East        0.246  21.1   26.6      10.2

group_by() splits the data into groups before the summarize step. summarize() collapses each group down to one row. arrange() sorts the result — desc() reverses the default ascending order.

We can export the summary for later use:

write_csv(dem_summary, "data/dem_summary.csv")

Your Turn

Summarize the dem_women data by region using median() instead of mean(). Do the rankings change?
Add a new summary statistic: the standard deviation of polyarchy using sd(). Which region has the most variation in democracy scores?
Try arrange(polyarchy) (without desc()). What changes?