Bar Charts and Histograms

April 23, 2024

Reading Data into R

Getting Started with Data

Tabular data is data that is organized into rows and columns
- a.k.a. rectangular data
A data frame is a special kind of tabular data used in data science
A variable is something you can measure
An observation is a single unit or case in your data set
The unit of analysis is the level at which you are measuring
- In a cross-section: country, state, county, city, individual, etc.
- In a time-series: year, month, day, etc.

Example

Some Basic R Code

<- is the assignment operator
- Use it to assign values to objects
# is the comment operator
- Use it to comment out code or add comments
- Different function than in markdown text
To call a library, use library() and name of library
- name of library does not have to be in quotes, e.g. library(readr)
- only when you install it, e.g. install.packages("readr")

Read Data into R

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

Viewing the Data in R

Use glimpse() to see the columns and data types:

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)

Rows: 6
Columns: 5
$ region    <chr> "The West", "Latin America", "Eastern Europe", "Asia", "Afri…
$ polyarchy <dbl> 0.8709230, 0.6371358, 0.5387451, 0.4076602, 0.3934166, 0.245…
$ gdp_pc    <dbl> 37.913054, 9.610284, 12.176554, 9.746391, 4.410484, 21.134319
$ flfp      <dbl> 52.99082, 48.12645, 50.45894, 50.32171, 56.69530, 26.57872
$ women_rep <dbl> 28.12921, 21.32548, 17.99728, 14.45225, 17.44296, 10.21568

Or use View() or click on the name of the object in your Environment tab to see the data in a spreadsheet:

Try It Yourself!

Open the CSV file to see what it looks like
Then use this code to read it into R and view it

# load libraries
library(readr)
library(dplyr)

dem_summary <- read_csv("data/dem_summary.csv")

glimpse(dem_summary)

05:00

Bar Charts

The Grammar of Graphics

Data viz has a language with its own grammar
Basic components include:
- Data we are trying to visualize
- Aesthetics (dimensions)
- Geom (e.g. bar, line, scatter plot)
- Color scales
- Themes
- Annotations

Let’s start with the first two, the data and the aesthetic…

library(readr)
library(ggplot2)

dem_summary <- read_csv("data/dem_summary.csv")

ggplot(dem_summary, aes(x = region, y = polyarchy))

This gives us the axes without any visualization:

Now let’s add a geom. In this case we want a bar chart so we add geom_col().

ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col()

That gets the idea across but looks a little depressing, so…

…let’s change the color of the bars by specifying fill = "steelblue".

ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue")

Note how color of original bars is simply overwritten:

Now let’s add some labels with the labs() function:

ggplot(dem_summary, aes(x = region, y = polyarchy)) + 
  geom_col(fill = "steelblue") +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

And that gives us…

Next, we reorder the bars with fct_reorder() from the forcats package.

library(forcats)

ggplot(dem_summary, aes(x = fct_reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

Note that we could also use the base R reorder() function here.

This way, we get a nice, visually appealing ordering of the bars according to levels of democracy…

Now let’s change the theme to theme_minimal().

ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue") + 
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    ) + theme_minimal()

Gives us a clean, elegant look.

Note that you can also save your plot as an object to modify later.

dem_bar_chart <- ggplot(dem_summary, aes(x = reorder(region, -polyarchy), y = polyarchy)) +
  geom_col(fill = "steelblue")

Which gives us…

dem_bar_chart

Now let’s add back our labels…

dem_bar_chart <- dem_bar_chart +
  labs(
    x = "Region", 
    y = "Avg. Polyarchy Score", 
    title = "Democracy by region, 1990 - present", 
    caption = "Source: V-Dem Institute"
    )

So now we have…

dem_bar_chart

And now we’ll add back our theme…

dem_bar_chart <- dem_bar_chart + theme_minimal()

Voila!

dem_bar_chart

Change the theme. There are many themes to choose from.

dem_bar_chart + theme_bw()

Your Turn!

glimpse() the data
Find a new variable to visualize
Make a bar chart with it
Change the color of the bars
Order the bars
Add labels
Add a theme
Try saving your plot as an object
Then change the labels and/or theme

10:00

Histograms

Purpose of Histograms

Histograms are used to visualize the distribution of a single variable
They are used for continuous variables (e.g., income, age, etc.)
A continuous variable is one that can take on any value within a range (e.g., 0.5, 1.2, 3.7, etc.)
A discrete variable is one that can only take on certain values (e.g., 1, 2, 3, etc.)
x-axis represents value of variable of interest
y-axis represents the frequency of that value

Example

Histogram Code

# load data
dem_women <- read_csv("data/dem_women.csv")

# filter to 2022
dem_women_2022 <- dem_women |>
  filter(year == 2022) 

# create histogram
ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Change Number of Bins

Change number of bins (bars) using bins or binwidth arguments (default number of bins = 30):

ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(bins = 50, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

At 50 bins…

At 100 bins…probably too many!

Using binwidth instead of bins…

ggplot(dem_women_2022, aes(x = flfp)) +
  geom_histogram(binwidth = 2, fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Number of Countries",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Setting binwidth to 2…

Change from Count to Density

ggplot(dem_women_2022, aes(after_stat(density), x = flfp)) +
  geom_histogram(fill = "steelblue") + 
  labs(
    x = "Percentage of Working Aged Women in Labor Force",
    y = "Density",
    title = "Female labor force participation rates, 2022",
    caption = "Source: World Bank"
    ) + theme_minimal()

Which gives us…

Your Turn!

Pick a variable that you want to explore the distribution of
Make a histogram
1. Only specify x = in aes()
2. Specify geom as geom_histogram
Choose color for bars
Choose appropriate labels
Change number of bins
Change from count to density

10:00