A Collaborative, Project-Based Data Science Course for Computer Science and Statistics Students
Part 3 of 3. Read Part 1: Dealing with Humanities Data and Part 2: Digital Humanities and Mary Shelley’s The Last Man
Data Science is a newly developing field that merges ideas from both statistics and computer science to translate raw data and contribute new understandings and knowledge in the world. In this course statistics will inform the discussion about what appropriate goals are for learning from the data and how the data will answer the questions raised. The computer science perspective will help us figure out which goals are feasible computationally, and how to achieve them.
Note: Content adapted from original curricular project
This course is an introduction to data science and centers on three topics, one for each week: Data manipulation, data visualization, and big data. I developed and co-taught this course with Ann Cannon (Professor of Mathematics and Statistics).
The prerequisite is either one course in statistics or one course in computer science. It is meant to be accessible to sophomores, but most of our students were juniors and seniors. The course is centered around one collaborative group project. It was taught on Cornell’s block plan calendar (one course at a time for 3.5 weeks) each day for two hours in the morning and two hours in the afternoon.
Ann focused on statistical questions related to appropriate goals for learning from a data set and techniques for performing the analysis. I instructed programming concepts necessary for effectively scraping, cleaning, and visualizing data. Our afternoons were spent in the computer lab applying what we had learned in the morning through a set of “real world” projects.
Students will gain the ability to
- Communicate statistical ideas clearly and accurately
- Understand the importance of and techniques for collecting data
- Appropriately clean data
- Ability to display and interpret data
- Understand simple linear models
- Follow the statistical process of starting with a question, collecting the requisite data, analyzing the data, and reporting on the results
- Understand fundamental programming concepts that will allow you to easily adapt to new computational environments
- Learn to design and specify solutions to problems in a systematic way using logic and mathematical reasoning
- Develop communication skills related to working successfully in a group and presenting your research results to others
The course was heavily project-based. Students worked on one term-long project. Each individual student proposed an idea for a project during the first week of the course. The best project ideas were chosen, and students were placed into groups of 3-4 to work on these projects for the remainder of the term. Each group gave a total of three weekly updates in both written and oral form including: group proposal, initial findings, and final report.
Given the short time frame, the work that the students did on their group projects was simply outstanding. Topics ranged from “analyzing the internet traffic at Cornell” to “mining images of Patrick Swayze” to “clustering Twitter data related to #ferguson”. As part of their final project, each student group created a web page that contains their final report, presentation materials, and source code.
During the first two weeks of the course, we spent our lab time on two guided projects that developed students’ familiarity with tools for manipulating and analyzing data that they could apply to their own projects. The Race Results Project was drawn from Chapter 2 of Case Studies in Data Science with R (see Resources below for full citation) and introduces students to web scraping, data cleaning, and data analysis. The Twitter Trends Project was drawn from the SIGCSE Nifty Assignments (see Resources below for full citation) and introduces students to gathering data via an application programming interface, geographic visualizations, and sentiment analysis.
For our first time teaching this course, we were very pleased with the results, and the students appeared to agree. The following are a few excerpts from our course evaluations:
- “I loved the open-ended project nature of the course. Allowing us to just take an idea and run with it meant that we learned a ton, even if it wasn’t about things that were directly related to the course. I loved that.”
- “Each instructor brings great perspective to the topic of data and data analysis, and the segmented lectures were a very effective tool in demonstrating the depth and breadth of the topic. The course did a good job lending itself to both CS and Statistics. One of my favorite courses I’ve ever taken!”
- “This course was very hands on learning. Everything we learned about computer science and statistics we got to put into actual use during the course. These two disciplines go together very well. The groups also worked very well with people having strengths in computer science and stats.”
- “I also loved getting to learn how to recreate graphs we see it the news-it was so cool!”
- “Interesting and timely topic, given that large datasets are being more and more commonly used for various purposes. Splitting the lectures between Ann’s This is what can be done with this data and Ross’s This is how you do it segments was a good move.”
- “This course looks at very interesting problems in real world data collection and research, an opportunity we probably wouldn’t ever have in another class at Cornell. The instructors were, in addition to being very smart about their specialities, very good at understanding and explaining the problems of people who had little to no experience with computer science or statistics.”
As with any new course, this one was far from perfect. Below are some excerpts from the evaluations that sum this up:
- “It was very clear that this was the first time the course was put together and while the ideas were very good they are not yet fine tuned. Both professors knew what they were looking for in assignments but sometimes did not convey that to us very well. Since it is a new topic to us we did not know exactly what they wanted and were graded on this much harsher than expected.”
- “This class is very difficult to teach in 4 weeks. Also, the fact that it was taught by two professors made the first week of the class a little strange. On the other side, the remaining three weeks of the class have been exponentially better. Both teachers showed that they were able to adapt with the class relatively quick.”
- “Excellent course, but please rethink the workload on the first and second weeks, especially since we’re learning a whole new programming language (and its many peculiarities) in addition to the projects.”
- “I also think the feedback could be much more constructive since it sometimes contradicted itself and it was apparent that it came from two separate sources. Instead of this way of giving evaluations I think using one class time period and having face to face meetings would be far more beneficial.”
We both want to build off this experience, and we are eager to co-teach the class again. Doing so will give us the opportunity to iron out some of the kinks and to get more familiar with each others’ portion of the course so that we might be able to teach it independently in the future. We plan to revise or replace the Race Results Project which did not go over entirely smoothly.
Resources & Materials
Deborah Nolan and Duncan Temple Lang. 2015. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. Chapman & Hall/CRC.
Nick Parlante, Julie Zelenski, Michelle Craig, John DeNero, Mark Guzdial, David J. Malan, Aditi Muralidharan, Eric Roberts, and Kevin Wayne. 2013. Nifty assignments. In Proceeding of the 44th ACM technical symposium on Computer science education (SIGCSE ’13). ACM, New York, NY, USA, 539-540. DOI: http://dx.doi.org/10.1145/2445196.2445356