The theme of this year’s United States Conference on Teaching Statistics (USCOTS) 2019, “Evaluating Evidence,” put an emphasis on the current discussion/debate on p-values and the use of the word “significant” when making statistical conclusions. Conference-wide presentations (1, 2, 3) offered updates to the ASA official statements on p-value based on the special issue of The American Statistician and potential ways to move beyond significance.
Now that USCOTS is four months behind us, we thought it would be a good idea to reflect on how it has impacted each of our teaching plans for this year. Each member of our editing team has shared their thoughts below. What are yours? [Share your plans in the comments section.]
If you are interested in starting a discussion about “statistical significance” in your own classroom, check out this cooperative learning activity that Laura Ziegler used in her department.
Impact on Steve’s teaching:
I left USCOTS feeling cautiously optimistic about the p-value/significance debate. On one hand, I was starting to feel like the discussion was spinning its wheels, focusing only on defining the problem and not on coming up with solutions. On the other hand, I learned about new ideas from numerous presenters that not only focused on alternative ways of evaluating evidence, but also on how to teach these new methods in the classroom. Despite my mixed feelings, the front running solution in my mind is covered in The American Statistician editorial: results blind science publishing (Locascio, 2017; 2019). Not only does results blind publishing mean that studies will be evaluated on the importance of their research questions and appropriateness of their study designs, but it will simultaneously remove limitations inherent to p-values and other similar measures that result in intentional (or unintentional) misuses of statistics. I think journals that implement this review strategy will be making a big step in the right direction.
In the classroom this semester, I want to actively reduce the emphasis on p-values and statistical significance to make sure my students are grasping the bigger picture in statistical problem-solving. I think instructors of statistics tend to overemphasize the results of models, which causes students to make quick, straightforward conclusions using p-values. In an attempt to remedy this, I will be making a more conscious effort to prioritize the overarching statistical process during class and on homework feedback.
Impact on Laura Le’s teaching:
After USCOTS, I brought the conversation back to my teaching team in Biostatistics. There were a few courses that were being revised and so it was a perfect time to discuss what to do about this phrase. We decided to continue using the phrase “statistical significance” in the introductory biostatistics courses, because it is a phrase our students will frequently encounter in the medical and public health literature. Instead, we decided to add some discussions and/or material about what this phrase does and does not mean. For example, in the online course that I redesigned, I incorporated possible implications when a result is or is not statistically significant.
Impact on Laura Ziegler’s teaching:
I attended USCOTS with three colleagues interested in statistics education. As a group, we decided that changes needed to be made with regards to how we teach hypothesis tests at our university. We have many flavors of introductory statistics in our Statistics department, nine to be exact! All instructors have their own methods of teaching, but we decided as a group that we wanted to be unified on how we approach significance. We held multiple meetings open to anyone (faculty or students) to discuss our plans. Participants included people who love p-values to those who did not necessarily think that they needed to be taught in an introductory statistics course. In our first meeting, we participated in an Academic Controversy cooperative activity to start the conversation about p-values. Approximately 50 faculty and students, including statisticians and non-statisticians, attended.
In our next meetings, we all agreed there is a larger conversation to be had about statistical significance, but we decided on the following changes that could be easily implemented this semester in the short term.
- Put more effort into teaching practical significance.
- Avoid teaching hypothesis tests as a stand-alone statistical method. Emphasize other analysis and discussions should occur along with hypothesis tests such as effect sizes or confidence intervals.
- Use a grey scale for significance. We adapted a scale from the CATALST project with a minor change, adding grey!
I personally love these changes, and look forward to hearing more discussions and suggestions on the topic.
Impact on Adam’s teaching:
Since I started teaching introductory statistics as a graduate student, I have taught hypothesis testing via the interpretation of p-values using the sliding scale Laura Z. outlined above, while mentioning the dogmatic p < 0.05 and “ranting” against its use. So why teach the dogmatic interpretation? Well… I usually tell myself it’s because students will see that usage outside of my course and that I am making students statistically-literate citizens… but upon further reflection, that’s simply a justification for writing problems that are easier to grade. Since USCOTS I made a resolution: I will still teach how to interpret a p-value as strength of evidence, but I will not lean on them to help students make overly-simplistic statements about that evidence. Yes, some test questions will be harder to grade, but having students express what evidence actually is/means will be powerful. Further, I will continue to recommend the use of confidence intervals, where possible, as students can see borderline situations—e.g. does the interval (0.001, 0.067) support a meaningful difference? Finally, I resolve to think about how to effectively discuss effect sizes in class. I admit that I am not familiar with how these are used across multiple disciplines, and I am leery of simplistic statements of what effect size is “big” or “interesting”, since these seem dangerously close to “significant”, but they do seem to be better tools. If you want to write a blog post on the topic, let us know!
Impact on Doug’s teaching:
I initially taught p-values with a narrow approach: alpha levels and dichotomous decisions. While I initially used a variety of alpha levels (more than just 0.05), there wasn’t much emphasis on when different alpha levels would be used – it was more procedurally focused. I then broadened my approach by considering different alpha levels for different contexts and emphasizing the relationship between alpha levels, Type I and II errors, and power. My next major teaching shift was to teach significance using the strength of evidence approach (as discussed by Laura Z. above). At my current institution, we emphasize the strength of evidence approach but also teach alpha levels for rejecting/failing to reject the null hypothesis. I present multiple ways to make a conclusion from p-values because it is plausible that students are going to encounter a variety of correct and incorrect uses of p-values after their introductory courses and preparing them as much as possible is key.
I have also started asking students a follow-up question after they have interpreted the p-value such as “If you were the company in the problem, would you choose to [discuss the different options here]…” Again, this is no panacea, but connecting the evidence back to a (hypothetical) real-world decision seems to make the idea of strength of evidence easier for students to grasp.
After the USCOTS keynote and ensuing discussions, my colleagues and I discussed where we wanted to go with this. We currently have plans for iteratively improving our introductory statistics courses over the next few years, and making changes with regard to p-values is on our list. We don’t know how we will be teaching evidence in a few years, but for now we are planning on having more common assignments that emphasize a variety of different ways of interpreting results. It’s a manageable first step toward something more.
Referenced USCOTS presentations: