Summarizing Data

April 23, 2024

Group, Summarize and Arrange

group_by(), summarize(), arrange()
A very common sequence of dplyr verbs:
- Take an average or some other statistic for a group
- Rank from high to low values of summary value

Setup

# Load packages
library(vdemdata) # to download V-Dem data
library(dplyr)

# Download the data
democracy <- vdem |> # download the V-Dem dataset
  filter(year == 2015)  |> # filter year, keep 2015
  select(                  # select (and rename) these variables
    country = country_name,     # the name before the = sign is the new name  
    vdem_ctry_id = country_id,  # the name after the = sign is the old name
    year, 
    polyarchy = v2x_polyarchy,
    libdem = v2x_libdem,
    corruption = v2x_corr,
    gdp_pc = e_gdppc, 
    region = e_regionpol_6C
    ) |>
  mutate(
    region = case_match(region, # replace the values in region with country names
                     1 ~ "Eastern Europe", 
                     2 ~ "Latin America",  
                     3 ~ "Middle East",   
                     4 ~ "Africa", 
                     5 ~ "The West", 
                     6 ~ "Asia")
  )

# View the data
glimpse(democracy)

Summarize by Region

# group_by(), summarize() and arrange()
dem_summary <- democracy |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy = mean(polyarchy, na.rm = TRUE), # calculate mean, remove NAs
    libdem = median(libdem, na.rm = TRUE),
    corruption = sd(corruption, na.rm = TRUE),
    gdp_pc = max(gdp_pc, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy)) # arrange in descending order by polyarchy score

# Print the data
dem_summary

Summarize by Region

# A tibble: 6 × 5
  region         polyarchy libdem corruption gdp_pc
  <chr>              <dbl>  <dbl>      <dbl>  <dbl>
1 The West           0.876  0.824     0.0647   81.7
2 Latin America      0.648  0.476     0.281    30.8
3 Eastern Europe     0.548  0.419     0.292    31.7
4 Asia               0.443  0.312     0.263    64.8
5 Africa             0.435  0.261     0.231    30.6
6 Middle East        0.271  0.171     0.250    91.2

Use group_by() to group countries by region…

# group_by(), summarize() and arrange()
dem_summary <- democracy |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy = mean(polyarchy, na.rm = TRUE), # calculate mean, remove NAs
    libdem = median(libdem, na.rm = TRUE),
    corruption = sd(corruption, na.rm = TRUE),
    gdp_pc = max(gdp_pc, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy)) # arrange in descending order by polyarchy score

# Print the data
dem_summary

Use summarize() to get the regional means polyarchy and gpd_pc….

# group_by(), summarize() and arrange()
dem_summary <- democracy |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy = mean(polyarchy, na.rm = TRUE), # calculate mean, remove NAs
    libdem = median(libdem, na.rm = TRUE),
    corruption = sd(corruption, na.rm = TRUE),
    gdp_pc = max(gdp_pc, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy)) # arrange in descending order by polyarchy score

# Print the data
dem_summary

Then use arrange() with desc() to sort in descending order by polyarchy score…

# group_by(), summarize() and arrange()
dem_summary <- democracy |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy = mean(polyarchy, na.rm = TRUE), # calculate mean, remove NAs
    libdem = median(libdem, na.rm = TRUE),
    corruption = sd(corruption, na.rm = TRUE),
    gdp_pc = max(gdp_pc, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy)) # arrange in descending order by polyarchy score

# Print the data
dem_summary

We are printing the data frame instead of using glimpse() here…

# group_by(), summarize() and arrange()
dem_summary <- democracy |> # save result as new object
  group_by(region)  |> # group data by region
  summarize(           # summarize following vars (by region)
    polyarchy = mean(polyarchy, na.rm = TRUE), # calculate mean, remove NAs
    libdem = median(libdem, na.rm = TRUE),
    corruption = sd(corruption, na.rm = TRUE),
    gdp_pc = max(gdp_pc, na.rm = TRUE)
  ) |> 
  arrange(desc(polyarchy)) # arrange in descending order by polyarchy score

# Print the data
dem_summary

Some Common Arithmetic Functions

sqrt() square root
log() natural logarithm
mean() mean
median() median
sd() standard deviation

Try it Yourself

Try running a group_by(), summarize() and arrange() in your Quarto document
Try changing the parameters to answer these questions:

Try summarizing the data with a different function for one or more of the variables.

What is the median value of polyarchy for The West?
What is the max value of libdem for Eastern Europe?
What is the standard deviation of corruption for Africa?
What is the mean of gdp_pc for the Middle East?

Now try grouping by country instead of region.

What is the median value of polyarchy for Sweden?
What is the max value of libdem New Zealand?
What is the standard deviation of corruption for Spain?
What is the interquartile range of gdp_pc for Germany?

Sort countries in descending order based on the mean value of gdp_pc (instead of the median value of polyarchy). Which country ranks first based on this sorting?
Now try sorting countries in ascending order based on the median value of libdem (hint: delete “desc” from the arrange() call). Which country ranks at the “top” of the list?

05:00

Visualize It!

library(ggplot2)

ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 2015", 
    caption = "Source: V-Dem Institute"
    ) + theme_minimal()

Visualize It!

Try it Yourself

Run the code and a bar chart with the dem_summary data you wrangled, again grouping by region (instead of country)
Try visualizing different variables, e.g. libdem, corruption, gdp_pc
Try different summary statistics, e.g. mean, median, standard deviation, etc.
Try grouping by country instead of region and visualizing that

10:00