Visual Inference: Using Sesame Street Logic to Introduce Key Statistical Ideas

As outlined by Cobb (2007), most introductory statistics books teach classical hypothesis tests as

  1. formulating null and alternative hypotheses, 
  2. calculating a test statistic from the observed data, 
  3. comparing the test statistic to a reference (null) distribution, and 
  4. deriving a p-value on which a conclusion is based.

This is still true for the first course, even after the 2016 GAISE guidelines were adapted to include normal- and simulation-based methods. Further, most textbooks attempt to carefully talk through the logic of hypothesis testing, perhaps showing a static example of hypothetical samples that go into the reference distribution. Applets, such as StatKey and the Rossman Chance ISI applets, take this a step further, allowing students to gradually create these simulated reference distributions in an effort to build student intuition and understanding. While these are fantastic tools, I have found that many students still struggle to understand what the purpose of a reference distribution is and the overarching logic of testing. To remedy this, I have been using visual inference to introduce statistical testing, where “plots take on the role of test statistics, and human cognition the role of statistical tests” (Buja et al., 2009). In this process, I continually encourage students to apply Sesame Street logic: which one of these is not like the other? By using this alternative approach that focuses on visual displays over numerical summaries, I have been pleased with the improvement in student understanding, so I thought I would share the idea with the community.

Visual inference via the lineup protocol

In visual inference, the lineup protocol (named after “police lineup” for criminal investigations) provides a direct analog for each step of a hypothesis test. 

  1. Competing claims: Similar to a traditional hypothesis test, a visual test begins by clearly stating the competing claims about the model/population parameters. 
  2. Test statistic: A plot displaying the raw data or fitted model (we’ll call it the observed plot) serves as the “test statistic” under the visual inference framework. This plot must be chosen to highlight features of the data that are relevant to the hypotheses in mind. For example, a scatterplot is a natural choice to examine whether or not there is a correlation between two quantitative variables, but will be less useful in the examination of association between a categorical and a quantitative variable. In that situation, side-by-side boxplots or overlaid density plots are more useful.
  3. Reference (null) distribution: Null plots are generated consistently with the null hypothesis and the set of all null plots constitutes the reference (or null) distribution. To facilitate comparison of the observed plot to the null plots, the observed plot is randomly situated in the field of null plots, just like a suspect is randomly situated amongst decoys in a police lineup. This arrangement of plots is called a lineup.
  4. Assessing evidence: If the null hypothesis is true, then we expect the observed plot to be indistinguishable from the null plots. If you (the observer) are able to identify the observed plot in the above lineup, then this provides evidence against the null hypothesis. If one wishes to calculate a visual p-value, then lineups need to be presented to a number of independent observers for evaluation. While this is possible, it is not a productive discussion in most intro stats classes that don’t do a deep dive into probability theory.    


As a first example in class, I use the creative writing experiment discussed in The Statistical Sleuth. The experiment was designed to explore whether creativity scores were impacted by the type of motivation (intrinsic or extrinsic). To evaluate this, creative writers were randomly assigned to a questionnaire where they ranked reasons they write: one questionnaire listed intrinsic motivations and the other listed extrinsic motivations. After completing the questionnaire, all subjects wrote a Haiku about laughter, which was graded for creativity by a panel of poets. Below, I will give a brief overview of each part of the visual lineup activity.

Competing claims

First, have my students discuss what competing claims are being investigated. I encourage them to write these in words before linking them with the mathematical notation they saw in the reading prior to class. The most common answer is: there is no difference in the average creative writing scores for the two groups vs. there is a difference in the average creative writing scores for the two groups. During the debrief, I make sure to link this to notation:


H_{A}:\mu_{intrinsic} - \mu_{extrinsic}\neq 0

EDA review

Next, I have students discuss what plot types would be most useful to investigate this claim, reinforcing topics from EDA.

Lineup evaluation

Most students recognize that side-by-side boxplots, faceted histograms, or density plots are reasonable choices to display the relevant aspects of the distribution of creative writing scores for each group. I then give them a lineup of side-by-side boxplots to evaluate (note I place a dot at the sample mean for each group), such as the one shown below. Here, the null plots are generated by permuting the treatment labels; thus, breaking any association present between the treatment and creativity scores. (I don’t give the students these details yet, I just tell them that one plot is the observed data while the other 19 agree with the null hypothesis.) I ask the students to

  1. choose which plot is the most different from the others, and
  2. explain why they chose that plot.

[Your turn! Try it out yourself! Which one of these is not like the other?]

Lineup discussion

Once all of the groups have evaluated their lineups and discussed their reasoning, we regroup for a class discussion. During this discussion, I reveal that the real data are shown in plot 10, and display these data on a slide so that we can point to particular features of the plot as necessary. After revealing the real data, I have students return to their groups to discuss whether they chose the real data, and whether their choices support either of the competing claims. Once the class regroups and thoughts are shared, I make sure that the class realizes that an identification of the raw data provides evidence against the null hypothesis (though I always hope students will be the ones saying this!).

Biased observers

When I first started this activity, I showed the students the real data prior to the lineup, which made them biased observers. Consequently, students had an easier time choosing the real data, and the initial discussion within their groups wasn’t as rich. However, I have seen little impact on the follow-up discussion focusing on whether the data are identifiably different and what that implies. 

Benefits of the lineup protocol

The strong parallels between visual inference and classical hypothesis testing make it a natural way to introduce the idea of statistical significance without getting bogged down in the minutiae/controversy of p-values, or the technical issues of describing a simulation procedure before students understand why that’s important. All of my students understand the question “which one of these is not like the others,” and this common understanding has generated fruitful discussion about the underlying inferential thought process without the need for a slew of definitions. In addition, after this activity I find it easier to discuss how we generate permutation resamples and conduct permutation tests, because students have seen permutations in lineups and have already thought about evaluating evidence.

Where would this fit into your course?

As you’ve seen, I use the lineup protocol in my intro stats course to introduce the logic behind hypothesis tests. 

In addition, I use visual inference to help students build intuition about new and unfamiliar plot types, such as Q-Q plots, mosaic plots, and residual plots. For example, when I introduce students to mosaic plots using the flying data set in the fivethirtyeight R package, I pick one pair of categorical variables, such as one’s opinion on whether it’s rude to bring a baby on a plane and their self-identified gender. Then, I have students create the mosaic plot and discuss what they see. Once they have recorded their thoughts, I provide a lineup consisting only of null plots (i.e., no association) and have them compare their observed plot to the null plots, discussing what this tells them about potential association.

How to create lineups for your classes

The nullabor R package makes creating lineups reasonably painless if you understand ggplot2 graphics. I’ve created a nullabor tutorial to help you create lineups for your classes, and am almost done with shiny apps to implement lineups in a variety of settings.

How Do We Encourage “Productive Struggle” in Large Classes?

Contributing author Catherine Case is a lecturer at the University of Georgia and the lesson plan editor for Statistics Teacher.

This post is really inspired by a plenary talk given by Jim Stigler at USCOTS 2015. He’s a psychologist at UCLA, and in his USCOTS talk, he emphasized the idea of productive struggle. He talked about different teaching cultures around the world, and how American classrooms often feature “quick and snappy” lessons as opposed to “slow and sticky” lessons, despite the fact that making the process of learning harder can actually lead to deeper, longer-lasting understanding.

His ideas really challenged me, because I often teach fairly large classes (120 – 140 students per section), and nowhere is “quick and snappy” more highly valued than in a large lecture. There’s definitely tension in large classes between efficiency and productive struggle. 

EfficiencyProductive Struggle
Statistical questions are clearly defined in the textbook.Students carry out the full problem-solving process.
Teacher solves all problems (correctly and on the first try).Students wrestle with concepts before strategies are directly taught.
Students use formulas and probability tables proficiently.Students use appropriate data analysis tools.

At first, this tension was overwhelming to me. In the stat ed community, we’re surrounded with inspiring, innovative ideas, but the gap between where we are and where we want to be can be paralyzing. To counter that, let’s start small with a simple classroom activity that allows students to struggle through the statistical process. Along the way, I’ll mention tricks that make it easier to pull off, even with lots of students in the room.

Example: A Survey of the Class

Formulate Questions

This activity is great for the beginning of the semester, because it only requires knowledge of a few statistical terms – statistical vs. survey questions, explanatory vs. response variables, categorical vs. quantitative variables. It also challenges students’ expectations about what’s required of them in a large lecture class, because right off the bat, they’re being asked to collaborate and communicate their statistical ideas.  

  • First, students work in groups to write a statistical question about the relationship between two variables that can be answered based on a class survey. Then they pass their card to another group.
  • After receiving another group’s card, students break down the statistical question into variables. Which is the explanatory variable and which is the response? Are these variables categorical or quantitative? Then they pass their card to another group.
  • Students write appropriate survey questions that could be used to collect data – one survey question per variable. 

I’ll admit that in many of my lessons, I have a well-defined statistical question in mind before class even starts. This activity is different, because students experience the messy process of formulating a statistical question and operationalizing it for a survey. 

Collect Data

Before the next class period, I read their work (or at least a “random sample” of their work ☺) and I try to close the feedback loop by discussing common issues that I noticed. Do some questions go beyond the scope of a class survey? Are certain kinds of variables commonly misclassified? How can we improve ambiguous survey questions? Even though my class is too large to talk to every student individually, this gives me an opportunity to respond to and challenge student thinking. 

Later we can use student-written questions as the starting point for data collection and analysis. I usually choose 10-15 survey questions (ideally relevant to more than one statistical question), and collect their data via Google Forms. When students answer open-ended questions like, “How many hours do you spend studying in a typical week,” it generates data that’s messy but manageable. It feels more authentic than squeaky clean textbook data, plus the struggle of cleaning a few hundred observations by hand may help students understand the need for better data cleaning methods.

Analyzing Data Using Appropriate Tools

“Appropriate tools” certainly aren’t one-size fits all, but for this activity, I need a tool that…

  • Can handle large(ish) datasets 
  • Is accessible for students – preferably free!
  • Makes it easy to construct graphs and calculate summary statistics

At UGA, we have a site license that makes JMP free for students, and many regularly bring their laptops to class, so JMP works well for us with students working in pairs. If I didn’t have access to JMP, I might consider CODAP, which looks a lot like Fathom (friendly drag and drop interface!) except it’s free and runs in a web browser. 

Speaking of a friendly interface, another hurdle in a large class is how to trouble-shoot technology for students, especially if you don’t have smaller “lab” sections or TA support during class. For me, it’s a delicate balance of scaffolding and classroom culture…

After demonstrating how to construct graphs and calculate summaries using software, I assign some straightforward data analysis questions with right/wrong answers. For this, I use an app called Socrative, which works similarly to clickers, except that it allows for both multiple choice and free response questions. Socrative allows me to give immediate feedback – for example, if they miss a question, I can provide them with the software instructions they need. In addition to feedback through Socrative, I try to normalize the process of struggling with new technology and encourage them to help each other. I remind them it’s impossible for me to help everyone individually, but I’m confident they can work together and solve most problems without me. Students generally rise to the challenge and accept that there are multiple sources of knowledge in the room.  

Once I’m confident students know how to use the necessary data analysis tools, we can try more challenging, open-ended questions. For example, I may choose a response variable and ask students to explore the data until they find a variable that’s a good predictor, then write a few sentences about that relationship. They need to use graphs and calculate statistics to answer this, but I’m not explicitly telling them which graphs and statistics to use, and I’m certainly not giving them “point here, click here” style instructions. There’s a little productive struggle involved!

Interpret Results in Context

In the following class, I present student analyses as a starting point for our interpretations. They already have a foundation for discussing effect sizes and strength of evidence, because they’ve considered the relationships among variables themselves. Students can offer deep insights about the limitations of the analysis (e.g., sampling issues, measurement issues, correlation vs. causation), because they’ve been involved with the investigation at every stage. 

Look Back and Ahead

The authors of the ISI curriculum (Tintle, et al.) include “look back and ahead” as the final step of the statistical process. At this step, students consider limitations of the study and propose future work.

This concept is really helpful in my teaching too. Earlier I mentioned students’ expectations, but I’m also working on managing my own expectations. I can’t let the idea of a perfect active learning class keep me from taking steps in the right direction. I don’t have to change everything in one semester and I can’t expect every activity I try will work. The best I can do is to make a few small changes right now, keep a journal to learn from my experiences, and keep moving forward.