Visual Inference: Using Sesame Street Logic to Introduce Key Statistical Ideas

As outlined by Cobb (2007), most introductory statistics books teach classical hypothesis tests as

  1. formulating null and alternative hypotheses, 
  2. calculating a test statistic from the observed data, 
  3. comparing the test statistic to a reference (null) distribution, and 
  4. deriving a p-value on which a conclusion is based.

This is still true for the first course, even after the 2016 GAISE guidelines were adapted to include normal- and simulation-based methods. Further, most textbooks attempt to carefully talk through the logic of hypothesis testing, perhaps showing a static example of hypothetical samples that go into the reference distribution. Applets, such as StatKey and the Rossman Chance ISI applets, take this a step further, allowing students to gradually create these simulated reference distributions in an effort to build student intuition and understanding. While these are fantastic tools, I have found that many students still struggle to understand what the purpose of a reference distribution is and the overarching logic of testing. To remedy this, I have been using visual inference to introduce statistical testing, where “plots take on the role of test statistics, and human cognition the role of statistical tests” (Buja et al., 2009). In this process, I continually encourage students to apply Sesame Street logic: which one of these is not like the other? By using this alternative approach that focuses on visual displays over numerical summaries, I have been pleased with the improvement in student understanding, so I thought I would share the idea with the community.

Visual inference via the lineup protocol

In visual inference, the lineup protocol (named after “police lineup” for criminal investigations) provides a direct analog for each step of a hypothesis test. 

  1. Competing claims: Similar to a traditional hypothesis test, a visual test begins by clearly stating the competing claims about the model/population parameters. 
  2. Test statistic: A plot displaying the raw data or fitted model (we’ll call it the observed plot) serves as the “test statistic” under the visual inference framework. This plot must be chosen to highlight features of the data that are relevant to the hypotheses in mind. For example, a scatterplot is a natural choice to examine whether or not there is a correlation between two quantitative variables, but will be less useful in the examination of association between a categorical and a quantitative variable. In that situation, side-by-side boxplots or overlaid density plots are more useful.
  3. Reference (null) distribution: Null plots are generated consistently with the null hypothesis and the set of all null plots constitutes the reference (or null) distribution. To facilitate comparison of the observed plot to the null plots, the observed plot is randomly situated in the field of null plots, just like a suspect is randomly situated amongst decoys in a police lineup. This arrangement of plots is called a lineup.
  4. Assessing evidence: If the null hypothesis is true, then we expect the observed plot to be indistinguishable from the null plots. If you (the observer) are able to identify the observed plot in the above lineup, then this provides evidence against the null hypothesis. If one wishes to calculate a visual p-value, then lineups need to be presented to a number of independent observers for evaluation. While this is possible, it is not a productive discussion in most intro stats classes that don’t do a deep dive into probability theory.    


As a first example in class, I use the creative writing experiment discussed in The Statistical Sleuth. The experiment was designed to explore whether creativity scores were impacted by the type of motivation (intrinsic or extrinsic). To evaluate this, creative writers were randomly assigned to a questionnaire where they ranked reasons they write: one questionnaire listed intrinsic motivations and the other listed extrinsic motivations. After completing the questionnaire, all subjects wrote a Haiku about laughter, which was graded for creativity by a panel of poets. Below, I will give a brief overview of each part of the visual lineup activity.

Competing claims

First, have my students discuss what competing claims are being investigated. I encourage them to write these in words before linking them with the mathematical notation they saw in the reading prior to class. The most common answer is: there is no difference in the average creative writing scores for the two groups vs. there is a difference in the average creative writing scores for the two groups. During the debrief, I make sure to link this to notation:


H_{A}:\mu_{intrinsic} - \mu_{extrinsic}\neq 0

EDA review

Next, I have students discuss what plot types would be most useful to investigate this claim, reinforcing topics from EDA.

Lineup evaluation

Most students recognize that side-by-side boxplots, faceted histograms, or density plots are reasonable choices to display the relevant aspects of the distribution of creative writing scores for each group. I then give them a lineup of side-by-side boxplots to evaluate (note I place a dot at the sample mean for each group), such as the one shown below. Here, the null plots are generated by permuting the treatment labels; thus, breaking any association present between the treatment and creativity scores. (I don’t give the students these details yet, I just tell them that one plot is the observed data while the other 19 agree with the null hypothesis.) I ask the students to

  1. choose which plot is the most different from the others, and
  2. explain why they chose that plot.

[Your turn! Try it out yourself! Which one of these is not like the other?]

Lineup discussion

Once all of the groups have evaluated their lineups and discussed their reasoning, we regroup for a class discussion. During this discussion, I reveal that the real data are shown in plot 10, and display these data on a slide so that we can point to particular features of the plot as necessary. After revealing the real data, I have students return to their groups to discuss whether they chose the real data, and whether their choices support either of the competing claims. Once the class regroups and thoughts are shared, I make sure that the class realizes that an identification of the raw data provides evidence against the null hypothesis (though I always hope students will be the ones saying this!).

Biased observers

When I first started this activity, I showed the students the real data prior to the lineup, which made them biased observers. Consequently, students had an easier time choosing the real data, and the initial discussion within their groups wasn’t as rich. However, I have seen little impact on the follow-up discussion focusing on whether the data are identifiably different and what that implies. 

Benefits of the lineup protocol

The strong parallels between visual inference and classical hypothesis testing make it a natural way to introduce the idea of statistical significance without getting bogged down in the minutiae/controversy of p-values, or the technical issues of describing a simulation procedure before students understand why that’s important. All of my students understand the question “which one of these is not like the others,” and this common understanding has generated fruitful discussion about the underlying inferential thought process without the need for a slew of definitions. In addition, after this activity I find it easier to discuss how we generate permutation resamples and conduct permutation tests, because students have seen permutations in lineups and have already thought about evaluating evidence.

Where would this fit into your course?

As you’ve seen, I use the lineup protocol in my intro stats course to introduce the logic behind hypothesis tests. 

In addition, I use visual inference to help students build intuition about new and unfamiliar plot types, such as Q-Q plots, mosaic plots, and residual plots. For example, when I introduce students to mosaic plots using the flying data set in the fivethirtyeight R package, I pick one pair of categorical variables, such as one’s opinion on whether it’s rude to bring a baby on a plane and their self-identified gender. Then, I have students create the mosaic plot and discuss what they see. Once they have recorded their thoughts, I provide a lineup consisting only of null plots (i.e., no association) and have them compare their observed plot to the null plots, discussing what this tells them about potential association.

How to create lineups for your classes

The nullabor R package makes creating lineups reasonably painless if you understand ggplot2 graphics. I’ve created a nullabor tutorial to help you create lineups for your classes, and am almost done with shiny apps to implement lineups in a variety of settings.

How Do We Encourage “Productive Struggle” in Large Classes?

Contributing author Catherine Case is a lecturer at the University of Georgia and the lesson plan editor for Statistics Teacher.

This post is really inspired by a plenary talk given by Jim Stigler at USCOTS 2015. He’s a psychologist at UCLA, and in his USCOTS talk, he emphasized the idea of productive struggle. He talked about different teaching cultures around the world, and how American classrooms often feature “quick and snappy” lessons as opposed to “slow and sticky” lessons, despite the fact that making the process of learning harder can actually lead to deeper, longer-lasting understanding.

His ideas really challenged me, because I often teach fairly large classes (120 – 140 students per section), and nowhere is “quick and snappy” more highly valued than in a large lecture. There’s definitely tension in large classes between efficiency and productive struggle. 

EfficiencyProductive Struggle
Statistical questions are clearly defined in the textbook.Students carry out the full problem-solving process.
Teacher solves all problems (correctly and on the first try).Students wrestle with concepts before strategies are directly taught.
Students use formulas and probability tables proficiently.Students use appropriate data analysis tools.

At first, this tension was overwhelming to me. In the stat ed community, we’re surrounded with inspiring, innovative ideas, but the gap between where we are and where we want to be can be paralyzing. To counter that, let’s start small with a simple classroom activity that allows students to struggle through the statistical process. Along the way, I’ll mention tricks that make it easier to pull off, even with lots of students in the room.

Example: A Survey of the Class

Formulate Questions

This activity is great for the beginning of the semester, because it only requires knowledge of a few statistical terms – statistical vs. survey questions, explanatory vs. response variables, categorical vs. quantitative variables. It also challenges students’ expectations about what’s required of them in a large lecture class, because right off the bat, they’re being asked to collaborate and communicate their statistical ideas.  

  • First, students work in groups to write a statistical question about the relationship between two variables that can be answered based on a class survey. Then they pass their card to another group.
  • After receiving another group’s card, students break down the statistical question into variables. Which is the explanatory variable and which is the response? Are these variables categorical or quantitative? Then they pass their card to another group.
  • Students write appropriate survey questions that could be used to collect data – one survey question per variable. 

I’ll admit that in many of my lessons, I have a well-defined statistical question in mind before class even starts. This activity is different, because students experience the messy process of formulating a statistical question and operationalizing it for a survey. 

Collect Data

Before the next class period, I read their work (or at least a “random sample” of their work ☺) and I try to close the feedback loop by discussing common issues that I noticed. Do some questions go beyond the scope of a class survey? Are certain kinds of variables commonly misclassified? How can we improve ambiguous survey questions? Even though my class is too large to talk to every student individually, this gives me an opportunity to respond to and challenge student thinking. 

Later we can use student-written questions as the starting point for data collection and analysis. I usually choose 10-15 survey questions (ideally relevant to more than one statistical question), and collect their data via Google Forms. When students answer open-ended questions like, “How many hours do you spend studying in a typical week,” it generates data that’s messy but manageable. It feels more authentic than squeaky clean textbook data, plus the struggle of cleaning a few hundred observations by hand may help students understand the need for better data cleaning methods.

Analyzing Data Using Appropriate Tools

“Appropriate tools” certainly aren’t one-size fits all, but for this activity, I need a tool that…

  • Can handle large(ish) datasets 
  • Is accessible for students – preferably free!
  • Makes it easy to construct graphs and calculate summary statistics

At UGA, we have a site license that makes JMP free for students, and many regularly bring their laptops to class, so JMP works well for us with students working in pairs. If I didn’t have access to JMP, I might consider CODAP, which looks a lot like Fathom (friendly drag and drop interface!) except it’s free and runs in a web browser. 

Speaking of a friendly interface, another hurdle in a large class is how to trouble-shoot technology for students, especially if you don’t have smaller “lab” sections or TA support during class. For me, it’s a delicate balance of scaffolding and classroom culture…

After demonstrating how to construct graphs and calculate summaries using software, I assign some straightforward data analysis questions with right/wrong answers. For this, I use an app called Socrative, which works similarly to clickers, except that it allows for both multiple choice and free response questions. Socrative allows me to give immediate feedback – for example, if they miss a question, I can provide them with the software instructions they need. In addition to feedback through Socrative, I try to normalize the process of struggling with new technology and encourage them to help each other. I remind them it’s impossible for me to help everyone individually, but I’m confident they can work together and solve most problems without me. Students generally rise to the challenge and accept that there are multiple sources of knowledge in the room.  

Once I’m confident students know how to use the necessary data analysis tools, we can try more challenging, open-ended questions. For example, I may choose a response variable and ask students to explore the data until they find a variable that’s a good predictor, then write a few sentences about that relationship. They need to use graphs and calculate statistics to answer this, but I’m not explicitly telling them which graphs and statistics to use, and I’m certainly not giving them “point here, click here” style instructions. There’s a little productive struggle involved!

Interpret Results in Context

In the following class, I present student analyses as a starting point for our interpretations. They already have a foundation for discussing effect sizes and strength of evidence, because they’ve considered the relationships among variables themselves. Students can offer deep insights about the limitations of the analysis (e.g., sampling issues, measurement issues, correlation vs. causation), because they’ve been involved with the investigation at every stage. 

Look Back and Ahead

The authors of the ISI curriculum (Tintle, et al.) include “look back and ahead” as the final step of the statistical process. At this step, students consider limitations of the study and propose future work.

This concept is really helpful in my teaching too. Earlier I mentioned students’ expectations, but I’m also working on managing my own expectations. I can’t let the idea of a perfect active learning class keep me from taking steps in the right direction. I don’t have to change everything in one semester and I can’t expect every activity I try will work. The best I can do is to make a few small changes right now, keep a journal to learn from my experiences, and keep moving forward. 

Get the p outta here! Discussing the USCOTS 2019 significance sessions

The theme of this year’s United States Conference on Teaching Statistics (USCOTS) 2019, “Evaluating Evidence,” put an emphasis on the current discussion/debate on p-values and the use of the word “significant” when making statistical conclusions. Conference-wide presentations (1, 2, 3) offered updates to the ASA official statements on p-value based on the special issue of The American Statistician and potential ways to move beyond significance. 

Now that USCOTS is four months behind us, we thought it would be a good idea to reflect on how it has impacted each of our teaching plans for this year. Each member of our editing team has shared their thoughts below. What are yours? [Share your plans in the comments section.]

If you are interested in starting a discussion about “statistical significance” in your own classroom, check out this cooperative learning activity that Laura Ziegler used in her department. 

Impact on Steve’s teaching: 

I left USCOTS feeling cautiously optimistic about the p-value/significance debate. On one hand, I was starting to feel like the discussion was spinning its wheels, focusing only on defining the problem and not on coming up with solutions. On the other hand, I learned about new ideas from numerous presenters that not only focused on alternative ways of evaluating evidence, but also on how to teach these new methods in the classroom. Despite my mixed feelings, the front running solution in my mind is covered in The American Statistician editorial: results blind science publishing (Locascio, 2017; 2019). Not only does results blind publishing mean that studies will be evaluated on the importance of their research questions and appropriateness of their study designs, but it will simultaneously remove limitations inherent to -values and other similar measures that result in intentional (or unintentional) misuses of statistics. I think journals that implement this review strategy will be making a big step in the right direction.

In the classroom this semester, I want to actively reduce the emphasis on p-values and statistical significance to make sure my students are grasping the bigger picture in statistical problem-solving. I think instructors of statistics tend to overemphasize the results of models, which causes students to make quick, straightforward conclusions using p-values. In an attempt to remedy this, I will be making a more conscious effort to prioritize the overarching statistical process during class and on homework feedback.  

Impact on Laura Le’s teaching:

After USCOTS, I brought the conversation back to my teaching team in Biostatistics. There were a few courses that were being revised and so it was a perfect time to discuss what to do about this phrase. We decided to continue using the phrase “statistical significance” in the introductory biostatistics courses, because it is a phrase our students will frequently encounter in the medical and public health literature. Instead, we decided to add some discussions and/or material about what this phrase does and does not mean. For example, in the online course that I redesigned, I incorporated possible implications when a result is or is not statistically significant.

Impact on Laura Ziegler’s teaching:

I attended USCOTS with three colleagues interested in statistics education. As a group, we decided that changes needed to be made with regards to how we teach hypothesis tests at our university. We have many flavors of introductory statistics in our Statistics department, nine to be exact! All instructors have their own methods of teaching, but we decided as a group that we wanted to be unified on how we approach significance. We held multiple meetings open to anyone (faculty or students) to discuss our plans. Participants included people who love p-values to those who did not necessarily think that they needed to be taught in an introductory statistics course. In our first meeting, we participated in an Academic Controversy cooperative activity to start the conversation about p-values. Approximately 50 faculty and students, including statisticians and non-statisticians, attended.

In our next meetings, we all agreed there is a larger conversation to be had about statistical significance, but we decided on the following changes that could be easily implemented this semester in the short term.

  1. Put more effort into teaching practical significance.
  2. Avoid teaching hypothesis tests as a stand-alone statistical method. Emphasize other analysis and discussions should occur along with hypothesis tests such as effect sizes or confidence intervals.
  3. Use a grey scale for significance. We adapted a scale from the CATALST project with a minor change, adding grey!

I personally love these changes, and look forward to hearing more discussions and suggestions on the topic.

Impact on Adam’s teaching:

Since I started teaching introductory statistics as a graduate student, I have taught hypothesis testing via the interpretation of p-values using the sliding scale Laura Z. outlined above, while mentioning the dogmatic p < 0.05 and “ranting” against its use. So why teach the dogmatic interpretation? Well… I usually tell myself it’s because students will see that usage outside of my course and that I am making students statistically-literate citizens… but upon further reflection, that’s simply a justification for writing problems that are easier to grade. Since USCOTS I made a resolution: I will still teach how to interpret a p-value as strength of evidence, but I will not lean on them to help students make overly-simplistic statements about that evidence. Yes, some test questions will be harder to grade, but having students express what evidence actually is/means will be powerful. Further, I will continue to recommend the use of confidence intervals, where possible, as students can see borderline situations—e.g. does the interval (0.001, 0.067) support a meaningful difference? Finally, I resolve to think about how to effectively discuss effect sizes in class. I admit that I am not familiar with how these are used across multiple disciplines, and I am leery of simplistic statements of what effect size is “big” or “interesting”, since these seem dangerously close to “significant”, but they do seem to be better tools. If you want to write a blog post on the topic, let us know!

Impact on Doug’s teaching:

I initially taught p-values with a narrow approach: alpha levels and dichotomous decisions. While I initially used a variety of alpha levels (more than just 0.05), there wasn’t much emphasis on when different alpha levels would be used – it was more procedurally focused. I then broadened my approach by considering different alpha levels for different contexts and emphasizing the relationship between alpha levels, Type I and II errors, and power. My next major teaching shift was to teach significance using the strength of evidence approach (as discussed by Laura Z. above). At my current institution, we emphasize the strength of evidence approach but also teach alpha levels for rejecting/failing to reject the null hypothesis. I present multiple ways to make a conclusion from p-values because it is plausible that students are going to encounter a variety of correct and incorrect uses of p-values after their introductory courses and preparing them as much as possible is key. 

I have also started asking students a follow-up question after they have interpreted the p-value such as “If you were the company in the problem, would you choose to [discuss the different options here]…” Again, this is no panacea, but connecting the evidence back to a (hypothetical) real-world decision seems to make the idea of strength of evidence easier for students to grasp.

After the USCOTS keynote and ensuing discussions, my colleagues and I discussed where we wanted to go with this. We currently have plans for iteratively improving our introductory statistics courses over the next few years, and making changes with regard to p-values is on our list. We don’t know how we will be teaching evidence in a few years, but for now we are planning on having more common assignments that emphasize a variety of different ways of interpreting results. It’s a manageable first step toward something more. 

Referenced USCOTS presentations: 

  1. Opening Session (Beth Chance, Danny Kaplan, Jessica Utts)
  2. Keynote by Ron Wasserstein and Allen Schirm
  3. Keynote by Kari Lock Morgan

Welcome to StatTLC!

Our editorial team welcomes you to the Statistics Teaching and Learning Corner (StatTLC), a virtual place to chat about statistics education. While there are many opportunities for educators to interact and disseminate research at conferences and in academic journals, there are fewer opportunities to informally discuss and share ideas and experiences. We have decided to launch this blog in an effort to share our own ideas and experiences teaching statistics and biostatistics at the college-level, but to also provide a platform for the statistics education community to share their ideas and experiences.

You can expect to see relatively short, digestible posts about teaching and pedagogy resources for both face-to-face and online courses, research with a focus on how to implement the findings in the classroom, and teaching experiences from faculty instructors, researchers, and teaching assistants. Be on the lookout for questions prompts and thought provoking statements to inspire further discussion in the comments section of each post!