Bar Charts and Histograms

April 23, 2024

Reading Data into R

Getting Started with Data


  • Tabular data is data that is organized into rows and columns
    • a.k.a. rectangular data
  • A data frame is a special kind of tabular data used in data science
  • A variable is something you can measure
  • An observation is a single unit or case in your data set
  • The unit of analysis is the level at which you are measuring
    • In a cross-section: country, state, county, city, individual, etc.
    • In a time-series: year, month, day, etc.

Example

Some Basic R Code


  • <- is the assignment operator
    • Use it to assign values to objects
  • # is the comment operator
    • Use it to comment out code or add comments
    • Different function than in markdown text
  • To call a library, use library() and name of library
    • name of library does not have to be in quotes, e.g. library(readr)
    • only when you install it, e.g. install.packages("readr")

Read Data into R


# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

Viewing the Data in R


Use glimpse() to see the columns and data types:

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)
Rows: 6
Columns: 5
$ region    <chr> "The West", "Latin America", "Eastern Europe", "Asia", "Afri…
$ polyarchy <dbl> 0.8709230, 0.6371358, 0.5387451, 0.4076602, 0.3934166, 0.245…
$ gdp_pc    <dbl> 37.913054, 9.610284, 12.176554, 9.746391, 4.410484, 21.134319
$ flfp      <dbl> 52.99082, 48.12645, 50.45894, 50.32171, 56.69530, 26.57872
$ women_rep <dbl> 28.12921, 21.32548, 17.99728, 14.45225, 17.44296, 10.21568

Or use View() or click on the name of the object in your Environment tab to see the data in a spreadsheet:

Try It Yourself!

  • Open the CSV file to see what it looks like
  • Then use this code to read it into R and view it
# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)
05:00

Bar Charts

The Grammar of Graphics

  • Data viz has a language with its own grammar
  • Basic components include:
    • Data we are trying to visualize
    • Aesthetics (dimensions)
    • Geom (e.g. bar, line, scatter plot)
    • Color scales
    • Themes
    • Annotations


Let’s start with the first two, the data and the aesthetic…


library(readr)
library(ggplot2)

dem_summary <- read_csv("data/dem_summary.csv")

ggplot(dem_summary, aes(x = region, y = polyarchy)) 

This gives us the axes without any visualization:


Now let’s add a geom. In this case we want a bar chart so we add geom_col().


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col()

That gets the idea across but looks a little depressing, so…


…let’s change the color of the bars by specifying fill = "steelblue".


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue")

Note how color of original bars is simply overwritten:


Now let’s add some labels with the labs() function:


ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue") +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

And that gives us…

Next, we reorder the bars with fct_reorder() from the forcats package.


library(forcats)

ggplot(dem_summary, aes(x = fct_reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )


Note that we could also use the base R reorder() function here.

This way, we get a nice, visually appealing ordering of the bars according to levels of democracy…


Now let’s change the theme to theme_minimal().


ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    ) + theme_minimal()

Gives us a clean, elegant look.


Note that you can also save your plot as an object to modify later.


dem_bar_chart <- ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue")

Which gives us…

dem_bar_chart


Now let’s add back our labels…


dem_bar_chart <- dem_bar_chart +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

So now we have…

dem_bar_chart


And now we’ll add back our theme…


dem_bar_chart <- dem_bar_chart + theme_minimal()

Voila!

dem_bar_chart

Change the theme. There are many themes to choose from.

dem_bar_chart + theme_bw()

Your Turn!

  1. glimpse() the data
  2. Find a new variable to visualize
  3. Make a bar chart with it
  4. Change the color of the bars
  5. Order the bars
  6. Add labels
  7. Add a theme
  8. Try saving your plot as an object
  9. Then change the labels and/or theme
10:00

Histograms

Purpose of Histograms

  • Histograms are used to visualize the distribution of a single variable
  • They are used for continuous variables (e.g., income, age, etc.)
  • A continuous variable is one that can take on any value within a range (e.g., 0.5, 1.2, 3.7, etc.)
  • A discrete variable is one that can only take on certain values (e.g., 1, 2, 3, etc.)
  • x-axis represents value of variable of interest
  • y-axis represents the frequency of that value

Example

Histogram Code


# load data
dem_women <- read_csv("data/dem_women.csv")

# filter to 2022
dem_women_2022 <- dem_women |>
  filter(year == 2022) 

# create histogram
ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Change Number of Bins


Change number of bins (bars) using bins or binwidth arguments (default number of bins = 30):


ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(bins = 50, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

At 50 bins…

At 100 bins…probably too many!


Using binwidth instead of bins


ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(binwidth = 2, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Setting binwidth to 2…

Change from Count to Density


ggplot(dem_women_2022, aes(after_stat(density), x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Density",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Which gives us…

Your Turn!

  1. Pick a variable that you want to explore the distribution of
  2. Make a histogram
    1. Only specify x = in aes()
    2. Specify geom as geom_histogram
  3. Choose color for bars
  4. Choose appropriate labels
  5. Change number of bins
  6. Change from count to density
10:00