Most of us statistics (and data science!) educators understand that knowing how to use statistical software is integral to student successes, both in their coursework and in their careers, for our statistics and data science majors. However, in many degree programs, software usage is seen as a means to an end – getting an analysis – rather than an end goal in its own right. How did this come about, why does it matter, and what can we do to change our software-related instruction? These are the questions I discuss below, first by looking at some history of programming in these contexts, then by presenting two current philosophies on how to incorporate programming.
Getting a bachelor’s degree in mathematics has long meant learning computer programming. As statistics degree offerings appeared, they adopted this convention and the emerging field of data science, with its inherent computational needs, has followed suit. Whether Java, C, R (or S-PLUS back in my day!), Python, or SAS – students pursuing a degree in statistics or data science routinely program in at least one of these languages. Unfortunately, this is often enforced by adding a course to an existing degree program. In some cases, this course is merely borrowed from another department and does not meet the students’ discipline-specific needs! Even in relatively new degree programs, our field’s approach to programming seems anachronistic.
We commonly use software in the classroom to help teach a variety of topics – exploring data graphically, computing classical summary or inferential statistics, or conducting a simulation to study the properties of a resampling technique – and these advances in the inclusion of software are often touted when discussing how we have modernized our curricula. However, if students do not build the programming skills necessary to implement and understand these analyses, then software becomes a black box.
Why Does This Matter?
While there are several reasons to revisit how we teach programming, the one I’m focusing on here is that programming is different than most other skills we teach – if a program is inefficient, doesn’t follow good programming practices, or is otherwise sub-optimal, it can still produce correct results! Programming is not just something students should be doing to get an answer. We have an obligation to go beyond teaching students how to write functional code – we must train high-quality programmers. Statistics and data science careers that make extensive use of programming are exceedingly popular in their own right and as data sets get larger and programming becomes a ubiquitous skill, there is immense value in students having not only an ability to write code that solves a problem, but in using best practices when doing so.
Degree programs have typically adopted one of two prevailing philosophies regarding programming instruction: integrated or standalone. Both approaches have advantages and disadvantages that are important for designing an optimal educational experience that prepares our students to write the end-to-end programs they will use in their careers. In this context, I’m defining end-to-end programming as the application of the following three components.
1. Data cleaning and preparation
2. Data summary, analysis, and modeling
3. Reporting/presentation of results
Of course, most programs make use of general computing concepts (e.g. file types, paths, etc.) and not every program needs to employ all three components. However, students trained to write this style of end-to-end program can easily adapt to writing programs that only require one or two of the components.
This approach typically focuses only on data summary, analysis, and modeling – concepts used as a means to an end for discipline-specific course content – e.g. a regression course teaching SAS modeling tools such as PROC REG but excluding any programming concepts not explicitly needed to complete the course. The most obvious pedagogical benefit to this approach is instructors can present a programming skill after students are familiar with the statistical concept. A second, but related, benefit of the integrated approach is logistical – students do not need to worry about when to take a programming course because programming is learned in concert with the discipline-specific content.
However, the drawbacks to solely using integrated instruction are substantial. The overarching issue is that students are less likely to gain an appreciation for, or even an understanding of, the general computing principles necessary to be a practicing statistician. One of the primary examples is that the classroom data sets to which students are exposed have already been sanitized, meaning a loss of opportunities to develop skills with reading, cleaning, and restructuring data. Students are also better able to understand the requirements for developing good data collection methods when exposed to the results of poorly collected and/or maintained data sets. Additionally, more instructional time is required to teach programming along with discipline-specific content.
Integrated instruction also requires all instructors to teach at least some computing concepts in addition to the course-specific content. Depending on department size, this can be an unrealistic expectation if all faculty are not well-versed in the same language because, as learners, it is important to expose students to the same language repeatedly. Exposing them to multiple languages is valuable, of course, but if done without proper structure, students cannot build on what they learned in an earlier course and instructors cannot assume prior knowledge.
I’m defining standalone instruction to mean classes covering the language-specific concepts and any general programming concepts required to effectively use the language. For example, in a SAS course this would mean not only covering SAS concepts but also including path/directory structure, file types/attributes, image resolution, etc. There are two common “flavors” of the standalone course: applications-focused and whole-language. The applications-focused flavor – where material on data summary/analysis/modeling (Component 2) provides students with the skills necessary to carry out discipline-specific analyses needed in their other courses – is similar to the integrated approach above except these analysis tools are all in a single course and some time may be devoted to data cleaning/preparation and reporting/presentation of results (Components 1 and 3, respectively). The whole-language approach provides much less in the way of Component 2 skills and instead focuses on Components 1 and 3 by teaching the analysis software from the computer science perspective by covering syntax, compilation, and good programming practices while including a few basic Component 2 concepts so students get practice writing end-to-end programs.
The applications-focused course suffers from several significant logistical issues – when should students take the course and what should be included? If taken too early, students are unlikely to understand most of the analysis techniques but if taken too late they cannot apply any of the programming skills in their discipline-specific courses, severely limiting the programming course’s utility. To determine the course’s content, instructors need to agree on what skills will be useful throughout the degree program and deviation from that list in later courses also reduces the course’s utility. Additionally, concentrating the analysis in a single course still deprives students of a deeper understanding of the software’s capabilities and operation and is less likely to instill an understanding of good programming practices.
The whole-language approach should still include simple analysis techniques which is both a benefit and a drawback. It removes the logistical barriers because students can take the course earlier in their degree program, but then it derives its usefulness from how programming is emphasized in the remainder of a student’s coursework. If future courses never/rarely require students to use the skills obtained in this early-career course, then its benefits are severely blunted. However, when used properly, the whole-language approach provides a solid foundation onto which students can add skills presented in later classes while lowering the instructor’s burden in those courses.
What Should We Do?
To best educate our students, we should apply both approaches in a way that minimizes drawbacks and maximizes benefits to make sure we are truly training programmers and not just teaching our students to write a program as a means to an end. Because students need to be prepared to write high-quality end-to-end programs, we need to explain to students early in their career what that process looks like. To meet these goals, I propose the following as a starting point for degree programs looking to modernize their approach to teaching programming skills.
1. Employ an early-career, standalone course using whole-language instruction. Use it to introduce the three components and establish good programming practices.
2. Use integrated instruction in the same language in multiple future courses, each time assessing the students’ programming skills.
3. Enforce a common set of good programming practices across all courses.
4. Apply a common rubric for assessing programming skills across all courses.
Of course, most of us are not in a position to rewrite our department’s curricula or convince our colleagues to teach their courses differently. However, we can all take steps, such as collaborating with whoever teaches your programming course (or proposing a new course!) and choosing to assess programming in our own classes to help our students develop these crucial skills. By building a strong foundation, vertically integrating a programming language into our curriculum, and enforcing good programming practices we can not only produce high-quality data scientists and statisticians, we can also move beyond just teaching programming and start training programmers for the careers that are waiting for them.
Contributing author Jonathan Duggins is a Teaching Assistant Professor in the Department of Statistics at North Carolina State University.
Duggins, J. and Blum, J. SAS Global Forum. March 29 – April 1 2020. The Past, Present, and Future of Training SAS Professionals in a University Program.