Syllabus - STA-230 (Fall 2023)

Last updated: Fri Sep 1 2023, 11:07

Course Information

Instructor:

  • Collin Nolte, Noyce 2216, noltecollin@grinnell.edu

Class Meetings:

  • STA-230-01 Noyce 2401, MWF 10:00-10:50AM
  • STA-230-02 Noyce 2401, MWF 1:00-1:50PM

Office Hours:

  • Noyce 2216, Monday 2-3, Tuesday 1030-1130, Thursday 1-2

Class Mentors

Course Description:

This course introduces core topics in data science using R programming. This includes introductions to getting and cleaning data, data management, exploratory data analysis, reproducible research, and data visualization. This course incorporates case studies from multiple disciplines and emphasizes the importance of properly communicating statistical ideas. Prerequisite: MAT-209 or STA-209. Suggested CSC-151 or computer programming experience.

Texts:

There is no required textbook for this course. All necessary materials will be posted on the course website.

There will be recommended readings from several published textbooks, all of which are freely available online and do not need to be purchased:

  1. Modern Data Science with R by Baumer, Kaplan,and Horton
  2. R for Data Science by Grolemund and Wickham
  3. An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani

\(~\)

Aims and Objectives

This course aims to develop in students informed critical and theoretical perspectives on data collection, data manipulation and production, and the use algorithmic techniques to process and analyze data.

Learning Objectives

After completing this course, students should be able to:

  • Apply methods of data exploration, visualization, and modeling to illustrate key findings and make justifiable inferences
  • Translate research questions into quantifiable metrics or models that can be evaluated
  • Use the R programming environment to manage, process, and format unstructured data, to generate meaningful data visualizations, and to fit and evaluate supervised or unsupervised learning models
  • Clearly and concisely communicate the findings of data-driven analyses to statistical and non-statistical audiences

\(~\)

Policies

Email

Generally, I try not to check my emails after 5pm on weekdays and usually not on weekends either. This isn’t to say that it won’t ever happen, only that it should not be anticipated. Rest assured, there are very few situations in STA-230 in which an unread email would lead to disasterous consequences. I will do my best to avoid assigning things due outside of these hours, but in the event that a situation like that does arise, you will not be punished for my failure to respond

Class Sessions

The core component of our class meetings will be working through hands-on labs in a paired programming environment. These pairs will be assigned during the first half of the semester. After the first project, you will have the freedom to choose your partner, or work independently near someone that you can occasionally consult with. During labs it is essential that you and your partner(s) work together, making certain that each of you understand your work equally well.

Most labs will begin with a brief “preamble” section that we will go through together as class. The purpose of this section is to introduce the topic of the lab and ensure a smooth start to each class meeting.

Attendance

Because this course involves substantial group work absences impact not only yourself, but also your classmates. That said, I understand that missing class is sometimes necessary. If you will be absent for any reason I ask to be notified as soon as possible. Showing up late or missing class more than twice without prior notice will negatively impact the participation component of your course grade.

Late Work

Assignments are generally due at 11:59pm on their assigned due-date. All assignments will have an automatic 48-hour grace period where they can still be submitted with a 5% penalty applied. After 48-hours, P-web assignment windows will close, but late work may still be submitted via email with a 20% penalty, unless solutions have been posted/shared or grades/feedback have been returned. Special exceptions to these policies must be arranged in advance of an assignment’s deadline, or coordinated with Grinnell College academic support staff.

Software

Software is an essential component of data science and will play an important role in this course. We will primarily use R, an open-source statistical software program. You will also be expected to write, document, and submit code used for projects and assignments throughout the semester.

You are welcome to use your own personal laptop, or a Grinnell College laptop, during the course. R is freely available and you can download it and it’s UI companion, R Studio, here (note: R must be downloaded and installed before R Studio):

  1. Download R from http://www.r-project.org/
  2. Download R Studio from http://www.rstudio.com/

You may also work on a classroom computer, all of which will have R and R Studio pre-installed.

Academic Honesty

At Grinnell College you are part of a conversation among scholars, professors, and students, one that helps sustain both the intellectual community here and the larger world of thinkers, researchers, and writers. The tests you take, the research you do, the writing you submit-all these are ways you participate in this conversation.

The College presumes that your work for any course is your own contribution to that scholarly conversation, and it expects you to take responsibility for that contribution. That is, you should strive to present ideas and data fairly and accurately, indicate what is your own work, and acknowledge what you have derived from others. This care permits other members of the community to trace the evolution of ideas and check claims for accuracy.

Failure to live up to this expectation constitutes academic dishonesty. Academic dishonesty is misrepresenting someone else’s intellectual effort as your own. Within the context of a course, it also can include misrepresenting your own work as produced for that class when in fact it was produced for some other purpose. A complete list of dishonest behaviors, as defined by Grinnell College, can be found here.

Inclusive Classroom

Grinnell College makes reasonable accommodations for students with documented disabilities. To receive accommodations, students must provide documentation to the Coordinator for Disability Resources, information can be found here. If you plan on using accommodations in this course, you should speak with me as early as possible in the semester so that we can discuss ways to ensure your full participation in the course.

Religious Holidays

Grinnell College encourages students who plan to observe holy days that coincide with class meetings or assignment due dates to consult with your instructor in the first three weeks of classes so that you may reach a mutual understanding of how you can meet the terms of your religious observance, and the requirements of the course.

\(~\)

Grading

Engagement and Participation - 5%

Participation in a lab-heavy course is absolutely critical. During labs you are expected to help your partner(s) learn the material (which goes beyond simply answering the lab questions), and they are expected to help you. Everyone will begin the semester with a baseline participation score of 80%, which will then move up or down depending on my subjective assessment of your behavior during class. You can very quickly raise this score to 100% by doing a superb job helping your lab partner(s), and working diligently to understand course material during class. Alternatively, you can lower this score by skipping class, letting your lab partner(s) do most of the work, using your phone or surfing the web during class, etc. If you are ever unsure of your participation standing, you can email me and I am happy to provide you an interim assessment.

Labs - 20%

All labs contain embedded questions that you and your lab partner will answer together. Each group member should submit their own copy of the group’s answers via P-web prior to the assignment’s due-date. Oftentimes multiple labs will be due at the same time, and you are welcome to upload your answers as a single file. Some lab questions will be scored for accuracy with feedback given, while others will be scored for effort/completion. If it becomes clear that you are your partner are using a “divide and conquer” approach to answering lab questions your score on that assignment will be penalized. Sketch solutions will typically be posted 3-5 days after lab write-ups are due. Any lab that is turned in after sketch solutions are posted will receive no more than 50% credit.

Individual Homework - 20%

There will be 5-7 individual homework assignments throughout the semester. These assignments are naturally cumulative, and intended to force you to combine concepts and methods from multiple labs. You are welcome to get help from DASIL mentors or me on these assignments, but your submissions should be your own work, and you should add the name(s) of anyone you received help from as a footnote on the question that they helped you with. Most homework assignments will also include a short reading and a reflection.

Data Cleaning/Visualization Take-home - 10%

This assessment is intended to measure your ability to clean and manipulate complex data to arrive at a specific endpoint (recreating a data visualization). The task should take approximately 4 hours to complete, but will have 48-hours to work on the assessment once it is assigned.

You can find details on the take-home at this link, which will become active later in the semester.

Midterm Project - 20%

For this project, you will develop an R Shiny app that visually explores a data set of your choosing. You may work on this project individually, or in a group of two. You will deliver a 5-minute in-class presentation demonstrating the capabilities of your app and discussing some of the trends in your data that your app displays. You will be evaluated on both the features of your app and how well you communicate what you see in the data.

You can find details on Midterm #2 at this link, which will become active later in the semester.

Final Project - 25%

This project is a start-to-finish data science application on a non-trivial data set. The final product is a three-page written report accompanied by R code and documentation. You may work on this project individually, or in a group of two or three.

You can find details on the final project at this link, which will become active later in the semester.

\(~\)

Misc

Why use R for Data Science?

If you’ve spent time reading about data science online you’ll undoubtedly have noticed the emphasis placed on the Python programming language. Indeed, research from Cal State University found Python was the most popular data science language in private industry, being mentioned in 42% of data scientist job postings. However, R, which was mentioned in 20% of job postings, is not far behind and offers a few advantages when approaching data science from a statistical perspective (hence this course having the STA prefix).

Both R and Python provide plenty of functions for data manipulation. However, because R was created by academic statisticians, it offers very strong data visualization and statistical modeling packages. On the other hand, Python is a general-purpose programming language that excels in production, deployment, and machine learning. Regardless of each language’s strengths and weaknesses, as an introductory course our focus is on the fundamental skills and thought processes used in data science – which is something that can be accomplished regardless of the tools used (which will change over time anyways).

Getting Help

In addition to visiting office hours and completing the recommended readings, there are many other ways in which you can find help on assignments and projects.

The Data Science and Social Inquiry Lab (DASIL) is staffed by mentors who are experienced in R programming and may be able to troubleshoot coding problems you are having. Many students who’ve successfully completed this course have made extensive use of the DASIL work space and its computing resources.

The online platform Stack Overflow is a useful resource to find user-generated coding solutions to common R problems. Nearly all professional data scientists have needed to “look up” a coding strategy on a site like Stack Overflow at some point in their career, and I have no problem with you doing the same on assignments or projects. However, if you make substantial use of a Stack Overflow answer (ie: actually integrating lines of code written by someone else into your work, not just getting help identifying the right functions/arguments) the expectation is that you cite or acknowledge doing so.

\(~\)

Topic List

  1. Introduction to R (data structures, functions, packages, help documentation, and R Markdown)
  2. Creating graphics (ggplot2 and data visualization principles)
  3. Data manipulation (reshaping using tidyr, wrangling using dplyr, merging and joining, string processing and regular expressions)
  4. Interactive data visualizations (plotly, maps with leaflet, dashboards with R Shiny)
  5. Introduction to unsupervised learning (principal component analysis, dimension reduction, and clustering)
  6. Introduction to supervised learning (cross-validation, performance metrics, regression, classification, and machine learning methods)

Acknowledgement

This syllabus and course are largely borrowed from Professor Ryan Miller who graciously permitted me to use them