Getting used to ggplot

Data Visualization, Week 2

Kieran Healy, Duke University

Outline for Today

  1. Housekeeping
  2. Basic Principles Again
  3. Introducing ggplot
  4. Summarizing a Variable

How to Navigate these Slides

  • When you view them online, notice the compass in the bottom right corner
  • You can go left or right, or sometimes down to more detail.
  • Hit the Escape key to get an overview of all the slides. On a phone or tablet, pinch to get the slide overview.
  • You can use the arrow keys (or swipe up and down) in this view, as well.
  • Hit Escape again to return to the slide you were looking at.
  • On a phone or tablet, tap the slide you want.

Reminder

  • There are two ways to learn R: the easy way and the tedious way.
  • The problem is that the easy way doesn't work.
  • You have to practice the examples and work through them manually. Type them out, even if you're just copying at the beginning. It really will help you get used to how the language works.

Reminder

  • You will benefit a lot from taking almost any R tutorial, whether from a textbook or online. The syllabus has links.
  • For example, Try R.

Principles Again

Perception

  • Visualizing data is not just a matter of good taste.
  • Basic perceptual processes play a very strong role.
  • These have consequences for how we will want to encode data when we visualize it---i.e., how and whether we choose to represent numbers or categories as shapes, colors, lengths, etc.

Perception

  • We more easily see edges, contrasts, and movement.
  • We judge relative differences rather than absolute values.
  • We tend to infer relationships between elements based on gestalt-like rules.

Hermann Grid Effect

  • Hermann Grid Effect

Contrast Effects

  • Contrast Effects

Color Makes things More Complex

  • Color makes things more complex

What stands out

  • (Miriah Meyer.)

Gestalt Rules

  • Bang Wong, Nature Methods 7 863 (2010)

Gestalt Rules

  • Bang Wong, Nature Methods 7 863 (2010)

plot of chunk unnamed-chunk-2

  • Example: Picking out a data point

plot of chunk unnamed-chunk-3

  • Highlight by shape

plot of chunk unnamed-chunk-4

  • Highlight by color

plot of chunk unnamed-chunk-5

  • Highlight by size

plot of chunk unnamed-chunk-6

  • Highlight by all three

plot of chunk unnamed-chunk-7

  • Multiple channels of comparison become uninterpretable very fast

plot of chunk unnamed-chunk-8

  • Unless your data has a lot of structure

The data on the graph are the reason for the existence of the graph.

Cleveland (1994, 25)

Writing Plots

Go get the Gapminder Data

gapminder.url <- "https://raw.githubusercontent.com/socviz/soc880/master/data/gapminder.csv"
my.data <- read.csv(url(gapminder.url))
dim(my.data)
## [1] 1704    6
head(my.data)
##   country continent year lifeExp      pop gdpPercap
## 1 Algeria    Africa 1952  43.077  9279525  2449.008
## 2 Algeria    Africa 1957  45.685 10270856  3013.976
## 3 Algeria    Africa 1962  48.303 11000948  2550.817
## 4 Algeria    Africa 1967  51.407 12760499  3246.992
## 5 Algeria    Africa 1972  54.518 14760787  4182.664
## 6 Algeria    Africa 1977  58.014 17152804  4910.417
  • Remember what we said before about everything being an object, and every object having a class.
## We'll be a bit more verbose
## to make things clearer
p <- ggplot(data=my.data,
            aes(x=gdpPercap,
                y=lifeExp)) 
  • ggplot works by building your plot piece by piece
  • We start with a clean data frame called my.data
  • Then we tell ggplot what pieces of it we are interested in right now.
  • We create an object called p containing this information
  • Here, x=gdpPercap and y=lifeExp say what will go on the x and the y axes
  • These are aesthetic mappings that connect pieces of the data to things we can actually see on a plot.

About aesthetic mappings

  • The aes() function links variables to things you will see on the plot.
  • The x and y values are the most obvious ones.
  • Other aesthetic mappings include, e.g., color, shape, and size.
  • These mappings are not directly specifying what specific, e.g., colors or shapes will be on the plot. Rather they say which variables in the data will be represented by, e.g., colors and shapes on the plot.

Adding layers to the plot

  • What happens when you type p at the console and hit return?
  • We need to add a layer to the plot.
  • This takes the p object we've created, and applies geom_point() to it, a function that knows how to take x and y values and plot them in a scatterplot.
p + geom_point()

plot of chunk unnamed-chunk-10

The Plot-Making Process

0. Start with your data in the right shape

1. Tell ggplot what relationships you want to see

2. Tell ggplot how you want to see them

3. Layer these pictures as needed

4. Fine-tune scales, labels, tick marks, etc

This layering process is literally additive

p <- ggplot(my.data,
            aes(x=gdpPercap, y=lifeExp))

p + geom_point()

plot of chunk unnamed-chunk-11

p + geom_point() +
    geom_smooth(method="loess") 

plot of chunk unnamed-chunk-12

  • Here we add a second geom. It's a loess smoother. There are others. Try lm, for example.
  • What happens when you put geom_smooth() first instead of second?
  • Notice how both geom_point and geom_smooth() inherit the information in p about what the x and y variables are.
p + geom_point() +
    geom_smooth(method="loess") +
    scale_x_log10()

plot of chunk unnamed-chunk-13

  • The next layer does not change anything in the underlying data. Instead it adjusts the x-axis scale.
p + geom_point(color="firebrick") +
    geom_smooth(method="loess") +
    scale_x_log10()

plot of chunk unnamed-chunk-14

  • Here, notice we changed the color of the points by specifying the color argument in geom_point(). This is called setting an aesthetic feature.
  • Setting an aesthetic has no relationship to the data. In the previous plot, the color red is not representing or mapping any feature of the data.
  • To see the difference between setting and mapping an aesthetic, let's go back to our p object and recreate it.
  • This time, in addition to x and y we tell ggplot to map the variable Continent to the color aesthetic.
p <- ggplot(my.data,
            aes(x=gdpPercap,
                y=lifeExp,
                color=continent))
  • Now there is a relationship or mapping between the data and the aesthetic.
  • The values of the variable continent will be represented by colors on the figure we draw.
p + geom_point() +
    scale_x_log10()

plot of chunk unnamed-chunk-16

  • Like this. We do not manually specify any colors. We told ggplot() to map the values of contintent to the property, or aesthetic, of color
  • Try mapping continent to the aesthetic shape.

Colorless green ideas sleep furiously

  • ggplot implements a "grammar" of graphics, an idea developed by Leland Wilkinson (2005).
  • The grammar gives you rules for how to map pieces of data to geometric objects (like points and lines) with attributes (like color and size), together with further rules for transforming the data if needed, adjusting scales, or projecting the results onto a coordinate system.
  • A key point is that, like other rules of syntax, it limits what you can say but doesn't make what you say sensible or meaningful.
  • It allows you to produce "sentences" (mappings of data to objects) but they can easily be garbled.

More work needed (1)

p + geom_line()

plot of chunk unnamed-chunk-17

More work needed (2)

p + geom_bar(stat="identity")

plot of chunk unnamed-chunk-18

Once you get used to it, this layered grammar lets you build up sophisticated plots

## "Not in" convenience operator
"%nin%" <- function(x, y) {
  return( !(x %in% y) )
}

p <- ggplot(subset(my.data, country %nin% "Kuwait"), aes(x=year, y=gdpPercap))

p1 <- p + geom_line(color="gray70", aes(group=country)) +
    geom_smooth(size=1.1, method="loess", se=FALSE)

p1 + facet_wrap(~ continent) + labs(x="Year", y="GDP")

plot of chunk unnamed-chunk-19

  • To see the logic of each step of a plot, peel the layers backwards from the last one to the first, and see which parts of the plot are changed, or disappear.
  • Also examine what happens if you change some of the arguments, e.g. se=TRUE, or method='lm', or what happens when you leave them at their defaults.