Module 2.2

Line Charts and Scatter Plots

Prework

Install the devtools package. Type install.packages("devtools") in your console. You will need this to install the vdemdata package because it is not on the CRAN Network.
Install the vdemdata package from GitHub. Type devtools::install_github("vdeminstitute/vdemdata") in your console.

In previous lessons, we have loaded the Tidyverse packages individually. However, it is also possible to load them by simply loading the tidyverse package. Let’s try doing it that way this time.

library(tidyverse)

Now we have all of the tidyverse packages we need including readr, dplyr, ggplot2 and scales.

We are also going to be using the vdemdata package to download data from the Varieties of Democracy project. We will load that package a little later, but be sure you have it installed before proceeding.

In this module we are going to learn how to make line charts and scatter plots. We will also learn how to add a additional dimensions to our visualizations by using color, size and linetype. For scatter plots, we will learn how to “decompose” our data by using facets. Finally, we will learn how to change the scale of our axes and how to label our points.

Line charts

Now let’s create a line chart. Line charts are usually the best option when we want to illustrate trends in our data. For this visualization, we will try to illustrate Samuel Huntington’s waves of democracy by showing how countries representing each of the three waves. The U.S. represents the first wave, Japan the second wave starting with the allied victory in WWII, and Portugal represents the first country to transition in the third wave.

First, let’s grab the relevant data using vdemdata and dplyr. We are going to be downloading the polyarchy measure for the U.S., Japan and Portugal as far back as the data are available. So first we will select country name, year and the polyarchy schore and then we will filter the data based on the three country names. We are saving these data in an object called dem_waves_ctrs.

library(vdemdata)

dem_waves_ctrs <- vdem |>
  select(
    country = country_name,     
    year, 
    polyarchy = v2x_polyarchy, 
  ) |>
  filter( 
    country %in% c("United States of America", # select countries in this list
                   "Japan", 
                   "Portugal")
    )

write_csv(dem_waves_ctrs, "data/dem_waves_ctrs.csv")

Next, we are going to do our ggplot() call. The data will be the dem_waves_ctrs object that we just created. You can should have the data saved but you can also download it here. For the aesthetics mapping, we will put the year on the x-axis and the polyarchy score on the y-axis. We will also specify color in the aes() call so that we can color the lines by region.

To get a line chart, we have to specify geom_line(). Then within the geom_line() function we will set the linewidth equal to `1’ so that the lines are a bit more visible.

Finally, we will add a labs() call as with the previous visualizations. But in addition to title, axis labels and a caption, we will also add color = "Country" to change the label of the legend to “Country” with a capital “C.”

# in this ggplot() call, we add a third dimension for line color
ggplot(dem_waves_ctrs, aes(x = year, y = polyarchy, color = country)) +
  geom_line(linewidth = 1) + # our geom is a line with a width of 1
  labs(
    x = "Year", 
    y = "Polyarchy Score", 
    title = 'Democracy in countries representing three different "waves"', 
    caption = "Source: V-Dem Institute", 
    color = "Country" # make title of legend to upper case
  )

Scatter plots

The last thing we are going to do in this lesson is to create a scatter plot. We use scatter plots in order to illustrate how two variables relate to each other (or not). In this example, we are going to illustrating modernization theory, which predicts a positive relationship between wealth and democracy, while also incorporating levels of women’s representation into our analysis.

We are going to start with the dem_women.csv file that you can find here. We will then group the data by country and calculate the mean for each variable. Note that in the group_by() call we also include region because we will want to keep it so that we can color our points by region.

dem_summary_ctry <- read_csv("data/dem_women.csv") |>
  group_by(country, region) |> # group by country, keep region
  summarize(
    polyarchy = mean(polyarchy, na.rm = TRUE),
    gdp_pc = mean(gdp_pc, na.rm = TRUE), 
    flfp = mean(flfp, na.rm = TRUE), 
    women_rep = mean(women_rep, na.rm = TRUE)
  )

Now let’s create our first scatter plot. Our ggplot() call looks similar to previous ones except for a few things. First we are calling geom_point() for our geom. But also notice that our aesthetics mapping includes four dimenstions: x, y, color and size. So here we are telling ggplot2 that we want wealth on the x-axis, the polyarchy score on the y-axis, to color the points based on region, and to vary the size of the points in relation to the level of women’s representation.

One last thing we want to do is to put our x-axis on a log scale and change the labels to reflect their dollar values. For the log scale, we can use the scale_x_log10() function and for the labels we can use the label_number() function from the scales package. We set the prefix to “$” and the suffix to “k” so that each number on the x-axis starts with a dollar sign and ends with “k” denoting “thousands.”

Note

We will encounter other useful scales functions including label_dollar() and label_percent() in future lessons.

Notice that in this example we introduce the scales package by including it as a prefix to the label_number() function, e.g. scales::label_number(prefix = "$", suffix = "k"). This allows us to use the package without having to load it, e.g. library(scales). It also has the benefit of generating a list of auto-complete suggestions for the many available functions in the scales package.

# in this ggplot() call we have four dimensions
# x, y, color, and size
ggplot(dem_summary_ctry, aes(x = gdp_pc, y = polyarchy, color = region, size = women_rep)) + 
  geom_point() + # use geom_point() for scatter plots
  scale_x_log10(labels = scales::label_number(prefix = "$", suffix = "k")) +
  labs(
    x= "GDP per Capita", 
    y = "Polyarchy Score",
    title = "Wealth and democracy, 1990 - present", 
    caption = "Source: V-Dem Institute", 
    color = "Region",
    size = "Women Reps"
    )

The plot does a good job of illustrating the basic point of modernization theory in that we do see the positive correlation between wealth and democracy. But we also see that there are some outliers and that a lot of the outlier countries are concentrated in the Middle East.

We also see that the distribution of women’s representation is somewhat orthogonal to wealth and democracy. Most wealthy western countries have high levels of women’s representation, but so do a lot of low- and middle-income countries in Africa, Asia and Latin America.

Adding a trend line

We can definitely see a relationship between wealth and democracy in the scatter plot, but how strong is it? One way to find out is to add a trend line. Let’s do this by adding another geom, geom_smooth(), and specifying a linear model with the argument method = "lm" We acn also set the linewidth of the trend line to 1 so that the line is more visible.

If we want to add a single trend while also maintaining the coloring by region, then we have to reconfigure the ggplot() call a bit. Specifically, we will want to move color = region to a separate aes() call in the geom_point() function, e.g. geom_point(aes(color = region)). If we don’t do this we will get separate trend lines for each region (try it and see!).

ggplot(dem_summary_ctry, aes(x = gdp_pc, y = polyarchy)) + 
  geom_point(aes(color = region)) + 
  geom_smooth(method = "lm", linewidth = 1) + 
  scale_x_log10(labels = scales::label_number(prefix = "$", suffix = "k")) +
  labs(
    x= "GDP per Capita", 
    y = "Polyarchy Score",
    title = "Wealth and democracy, 1990 - present", 
    caption = "Source: V-Dem Institute", 
    color = "Region"
    )

Facet wrapping

Now let’s imagine that we really interested in drilling down into the “heterogeneous effects” of wealth on democracy by region. In other words, we want to see more clearly how wealth is related to democracy in some regions but not others. For this, we can use facet_wrap() to get a separate chart for each region rather than just shading the points by region. Inside of facet_wrap() we identify region as the variable that we want to use to separate the plots, e.g. facet_wrap(~region). Notice how we have to include a tilde (~) here.

ggplot(dem_summary_ctry, aes(x = gdp_pc, y = polyarchy)) + 
  geom_point() + 
  geom_smooth(method = "lm", linewidth = 1) + 
  facet_wrap(~ region) +
  scale_x_log10(labels = scales::label_number(prefix = "$", suffix = "k")) +
  labs(
    x= "GDP per Capita", 
    y = "Polyarchy Score",
    title = "Wealth and democracy, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

Here we can clearly see a relationship between wealth and democracy in all of the countries except for the Middle East and Africa. We could speculate that the lack of a relationship in the Middle East could be evidence of an oil curse dynamic whereas perhaps the lack of a relationship in Africa is due to weak institutions.

The relationship between wealth and democracy in the West would be apparent, but it is obscured by the fact that western countries because the high wealth and polyarchy values result in extreme bunching in the northwest quadrant of the graph. To deal with this issue, we could add the scales = "free" argument to our plot.

ggplot(dem_summary_ctry, aes(x = gdp_pc, y = polyarchy)) + 
  geom_point() + 
  geom_smooth(method = "lm", linewidth = 1) + 
  facet_wrap(~ region, scales = "free") +
  scale_x_log10(labels = scales::label_number(prefix = "$", suffix = "k")) +
  labs(
    x= "GDP per Capita", 
    y = "Polyarchy Score",
    title = "Wealth and democracy, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

But notice there is a bit of a tradeoff here. With the scales = free option set, we now have separate axes for each of the plots. This is less of a clean look than having common x and y axes.

Labeling points

Now let’s try drilling down into one of the regions to get a better sense of what countries are driving the relationship. To do this, we can filter our data set for a region that we are interested in and then add country labels to the points in the scatter plot. Here we are going to filter for “Asia” and we will ad a geom_text() call to add country labels. In the geom_text() call we include arguments for size and vjust to adjust the size and vertical location of the labels relative to the points.

dem_summary_ctry |> 
  filter(region == "Asia") |>
  ggplot(aes(x = gdp_pc, y = polyarchy)) + 
    geom_point() + 
    geom_text(aes(label = country), size = 2, vjust = 2) +
    geom_smooth(method = "lm", linewidth = 1) +
    scale_x_log10(labels = scales::label_number(prefix = "$", suffix = "k")) +
      labs(
        x= "GDP Per Capita", 
        y = "Polyarchy Score",
        title = "Wealth and democracy in Asia, 1990 - present", 
        caption = "Source: V-Dem Institute"
        )