Numerical Data

April 23, 2024

Electoral Democracy Measure


  • To what extent is the ideal of electoral democracy in its fullest sense achieved?
  • Measure runs from 0 (lowest) to 1 (highest)
  • 0.5 is a cutoff for distinguishing electoral democracy from electoral autocracy

The electoral principle of democracy seeks to embody the core value of making rulers responsive to citizens, achieved through electoral competition for the electorate’s approval under circumstances when suffrage is extensive; political and civil society organizations can operate freely; elections are clean and not marred by fraud or systematic irregularities; and elections affect the composition of the chief executive of the country. In between elections, there is freedom of expression and an independent media capable of presenting alternative views on matters of political relevance. – V-Dem Codebook

Other High-Level V-Dem Measures


  • Liberal Democracy
  • Egalitarian Democracy
  • Participatory Democracy
  • Deliberative Democracy

All continuous measures, ranging from 0 to 1. Let’s take a look at how to summarize data like this!

Data Setup


# Load packages 
library(vdemdata)
library(tidyverse)

# Create dataset for year 2022, with country name, year, and electoral dem
vdem2022 <- vdem |>
  filter(year == 2022)  |>
  select(
    country = country_name, 
    year, 
    polyarchy = v2x_polyarchy, 
    region = e_regionpol_6C 
    ) |>
  mutate(region = case_match(region, 
                        1 ~ "Eastern Europe", 
                        2 ~ "Latin America",  
                        3 ~ "Middle East",   
                        4 ~ "Africa", 
                        5 ~ "The West", 
                        6 ~ "Asia")) 

Examine the Data


glimpse(vdem2022)
Rows: 179
Columns: 4
$ country   <chr> "Mexico", "Suriname", "Sweden", "Switzerland", "Ghana", "Sou…
$ year      <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, …
$ polyarchy <dbl> 0.598, 0.770, 0.899, 0.898, 0.633, 0.692, 0.833, 0.093, 0.20…
$ region    <chr> "Latin America", "Latin America", "The West", "The West", "A…


How can we summarize measures of democracy? 🤔


We could calculate the mean.

vdem2022 |>
  summarize(mean_democracy = mean(polyarchy))
  mean_democracy
1          0.497

The mean is the average of the values. Common measure of central tendency but sensitive to outliers.


How can we summarize measures of democracy? 🤔


We could calculate the median.

vdem2022 |>
  summarize(median_democracy = median(polyarchy))
  median_democracy
1            0.501

The median is the value that separates the higher half from the lower half of the data.


We can also describe the shape of the distribution…

  • symmetric (e.g. normal)
  • right-skewed
  • left-skewed
  • unimodal (one peak)
  • bimodal (multiple peaks)

Histograms

  • Used to represent the distribution of a continuous variable
  • The x-axis represents the range of values
  • The y-axis represents the frequency of each value
  • The bars represent the number of observations in each range or “bin”
  • The shape of the histogram can tell us a lot about the distribution of the data

Symmetric Distributions

Symmetric Distributions

Skewed Distributions

Skewed Distributions

Bimodal Distribution

When is the Mean Useful?

When is the Mean Useful?

When is the Mean Useful?

When is the mean useful?


  • The Mean works well as a summary statistic when the distribution is relatively symmetric
  • Not as well when distributions are skewed or bimodal (or multi-modal)
  • With skewed distributions, the mean is sensitive to extreme values
  • The median is more robust

Lesson

  • Always look at your data!!
  • When reading or in a presentation, ask yourself:
    • Does the mean make sense given the distribution of the measure?
    • Could extreme values in a skewed distribution make the mean not as useful?
    • Have the analysts shown you the distribution? If not, ask about it!

Visualize Our Measure


Visualize Our Measure


mn <- mean(vdem2022$polyarchy)
med <- median(vdem2022$polyarchy)

ggplot(vdem2022, aes(x = polyarchy )) +
  geom_histogram(binwidth = .05, fill = "steelblue") +
   labs(
    x = "Electoral Democracy", 
    y = "Frequency", 
    title = "Distribution of Electoral Democracy in 2022", 
    caption = "Source: V-Dem Institute"
  ) + 
  geom_vline(xintercept = mn, linewidth = 1, color = "darkorange") +
  theme_minimal() 

Visualize Our Measure


mn <- mean(vdem2022$polyarchy)
med <- median(vdem2022$polyarchy)

ggplot(vdem2022, aes(x = polyarchy )) +
  geom_histogram(binwidth = .05, fill = "steelblue") +
   labs(
    x = "Electoral Democracy", 
    y = "Frequency", 
    title = "Distribution of Electoral Democracy in 2022", 
    caption = "Source: V-Dem Institute"
  ) + 
  geom_vline(xintercept = mn, linewidth = 1, color = "darkorange") +
  theme_minimal() 

Your Turn!

  • Look at the V-Dem codebook
  • Select a different high-level measure of democracy
  • Preprocess your data to include tha measure in your data frame
  • Calculate the mean and median and store as a variable
  • Visualize the distribution of the measure
  • Include a vertical line for the mean
  • Now try the median
05:00

Recap


  • We can use statistics like mean or median to describe the center of a variable
  • We can visualize the entire distribution to charachterize the distribution of the variable
  • We should also say something about the spread of the distribution

Why Measure and Visualize Spread?

Measures of Spread: Range


  • Range (min and max values)
  • Not ideal b/c does not tell us much about where most of the values are located
vdem2022 |>
  summarize(min = min(polyarchy),
            max = max(polyarchy))
    min   max
1 0.016 0.916

Measure of Spread: Interquartile Range

IQR: 25th percentile - 75th percentile

Interquartile Range

  • The middle 50 percent of the countries in the data lie between 0.262 and 0.747
  • The IQR (0.485) is the difference between the Q3 and Q1 values
vdem2022 %>% 
  summarize(IQRlow =  quantile(polyarchy, .25),
            IQRhigh = quantile(polyarchy, .75),
            IQRlength = IQR(polyarchy)
          )
  IQRlow IQRhigh IQRlength
1  0.262   0.747     0.485

Box Plot

  • A box plot is a graphical representation of the distribution based on the median and quartiles
  • It is a standardized way of displaying the distribution of data based on a five number summary: minimum, first quartile, median, third quartile, and maximum

Box Plot

Code
ggplot(vdem2022, aes(x = "", y = polyarchy)) +
  geom_boxplot(fill = "steelblue") + 
   labs(
    x = "", 
    y = "Electoral Democracy", 
    title = "Distribution of Electoral Democracy in 2022", 
    caption = "Source: V-Dem Institute"
  ) +
  theme_minimal()

Measure of Spead: Standard Deviation


  • Can think of it as something like the “average distance” of each data point from the mean
vdem2022 |>
  summarize(mean = mean(polyarchy),
            stdDev = sd(polyarchy))
   mean   stdDev
1 0.497 0.259951

Standard Deviation


  • A low standard deviation indicates that the values tend to be close to the mean
  • A high standard deviation indicates that the values are spread out over a wider range

Starting with Variance


  • Variance is a step towards calculating the standard deviation.
  • It quantifies the average squared deviation of each number from the mean of the data set.

Calculating Deviation from the Mean

  • First, calculate the mean (\(\bar{X}\)) of the dataset.
  • For each data point (\(X_i\)), calculate its deviation from the mean: \[e_i = X_i - \bar{X}\]
    • Example with a mean of 5:
      • For a data point where \((X_i = 0): (0 - 5 = -5)\)
      • For a data point where \((X_i = 10): (10 - 5 = 5)\)

Squaring the Deviations

  • Squaring each deviation (\(e_i\)) to eliminate negative values: \[e_i^2 = (X_i - \bar{X})^2\]
  • Summing up all squared deviations: \[\sum_{i=1}^{n} (X_i - \bar{X})^2\]
  • This sum represents the total squared deviation from the mean.

Calculating the Variance

  • Divide the total squared deviation by \((n-1)\) (to account for the sample variance): \[\text{Variance} = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\]
  • Using \((n-1)\) ensures an unbiased estimate of the population variance when calculating from a sample.

Deriving the Standard Deviation


  • The standard deviation is the square root of the variance: \[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2}\]
  • Taking the square root converts the variance back to the units of the original data.

Standard Deviation Simple Example


x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
e <- x - mean(x)
e
 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5

Standard Deviation Simple Example


x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
e_squared <- e^2
e_squared
 [1] 25 16  9  4  1  0  1  4  9 16 25

Standard Deviation Simple Example


x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sum_e_squared <- sum(e_squared)
sum_e_squared
[1] 110

Standard Deviation Simple Example


variance <- sum_e_squared/(length(x)-1)
variance
[1] 11

Standard Deviation Simple Example


standard_dev <- sqrt(variance)
standard_dev
[1] 3.316625
sd(x)
[1] 3.316625

Your Turn!


  • Calculate measures of spread for the polyarchy variable in the V-Dem data (mean, median, IQR, standard deviation)
  • How would you interpret these measures?
  • Try a box plot for the polyarchy variable
  • Try another variable in the V-Dem data
  • How does it compare to polyarchy?
05:00

Calculating Statistics by groups


  • What if we want to describe electoral democracy and see how it differs by some different variable? For example, by world region, or by year?
  • In this case we want to combine numerical summaries with categorical variables
  • This brings us back to bar chart

Calculating Statistics by Groups

  • Let’s calculate the mean and median of electoral democracy in each world region
  • For this, we add the group_by() to our previous code
vdem2022 |>
  group_by(region) |>
  summarize(mean_dem = mean(polyarchy),
            median_dem = median(polyarchy))
# A tibble: 6 × 3
  region         mean_dem median_dem
  <chr>             <dbl>      <dbl>
1 Africa            0.403      0.371
2 Asia              0.424      0.428
3 Eastern Europe    0.533      0.558
4 Latin America     0.605      0.678
5 Middle East       0.235      0.213
6 The West          0.854      0.857

Calculating Statistics by Groups

  • Let’s store our statistics as a new data object, democracy_region
democracy_region <- vdem2022 |> 
  group_by(region) |>
  summarize(mean_dem = mean(polyarchy),
            median_dem = median(polyarchy))

democracy_region
# A tibble: 6 × 3
  region         mean_dem median_dem
  <chr>             <dbl>      <dbl>
1 Africa            0.403      0.371
2 Asia              0.424      0.428
3 Eastern Europe    0.533      0.558
4 Latin America     0.605      0.678
5 Middle East       0.235      0.213
6 The West          0.854      0.857

Visualize using our Bar Chart Skills

Code
ggplot(democracy_region, aes(x = reorder(region, -mean_dem), y = mean_dem)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Mean Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    ) + 
  theme_minimal()

Numerical Variable by Group

How should we interpret this plot?

Code
library(ggridges)
#library(forcats)
  ggplot(vdem2022, aes(x = polyarchy, y = region, fill = region)) +
    geom_density_ridges() +
  labs(
    x = "Electoral Democracy",
    y = "Region",
    title = "A Ridge Plot",
    caption = "Source: V-Dem Institute",
  ) +
  scale_fill_viridis_d() +
  theme_minimal()

Your Turn!


  • Make a bar chart summarizing polyarchy or some other V-Dem variable
  • Now try your hand at a ridge plot
05:00