Last updated: Fri Sep 1 2023, 11:07
Instructor:
Class Meetings:
Office Hours:
Class Mentors
STA-230-01 Yuki Huang (huangyuk@grinnell.edu)
STA-230-02 Anne Bader (baderann@grinnell.edu)
Course Description:
This course introduces core topics in data science using R programming. This includes introductions to getting and cleaning data, data management, exploratory data analysis, reproducible research, and data visualization. This course incorporates case studies from multiple disciplines and emphasizes the importance of properly communicating statistical ideas. Prerequisite: MAT-209 or STA-209. Suggested CSC-151 or computer programming experience.
Texts:
There is no required textbook for this course. All necessary materials will be posted on the course website.
There will be recommended readings from several published textbooks, all of which are freely available online and do not need to be purchased:
\(~\)
This course aims to develop in students informed critical and theoretical perspectives on data collection, data manipulation and production, and the use algorithmic techniques to process and analyze data.
After completing this course, students should be able to:
R
programming environment to manage, process,
and format unstructured data, to generate meaningful data
visualizations, and to fit and evaluate supervised or unsupervised
learning models\(~\)
Generally, I try not to check my emails after 5pm on weekdays and usually not on weekends either. This isn’t to say that it won’t ever happen, only that it should not be anticipated. Rest assured, there are very few situations in STA-230 in which an unread email would lead to disasterous consequences. I will do my best to avoid assigning things due outside of these hours, but in the event that a situation like that does arise, you will not be punished for my failure to respond
Class Sessions
The core component of our class meetings will be working through hands-on labs in a paired programming environment. These pairs will be assigned during the first half of the semester. After the first project, you will have the freedom to choose your partner, or work independently near someone that you can occasionally consult with. During labs it is essential that you and your partner(s) work together, making certain that each of you understand your work equally well.
Most labs will begin with a brief “preamble” section that we will go through together as class. The purpose of this section is to introduce the topic of the lab and ensure a smooth start to each class meeting.
Attendance
Because this course involves substantial group work absences impact not only yourself, but also your classmates. That said, I understand that missing class is sometimes necessary. If you will be absent for any reason I ask to be notified as soon as possible. Showing up late or missing class more than twice without prior notice will negatively impact the participation component of your course grade.
Late Work
Assignments are generally due at 11:59pm on their assigned due-date. All assignments will have an automatic 48-hour grace period where they can still be submitted with a 5% penalty applied. After 48-hours, P-web assignment windows will close, but late work may still be submitted via email with a 20% penalty, unless solutions have been posted/shared or grades/feedback have been returned. Special exceptions to these policies must be arranged in advance of an assignment’s deadline, or coordinated with Grinnell College academic support staff.
Software
Software is an essential component of data science and will play an
important role in this course. We will primarily use R
, an
open-source statistical software program. You will also be expected to
write, document, and submit code used for projects and assignments
throughout the semester.
You are welcome to use your own personal laptop, or a Grinnell
College laptop, during the course. R
is freely available
and you can download it and it’s UI companion, R Studio
,
here (note: R
must be downloaded and installed before
R Studio
):
R
from http://www.r-project.org/R Studio
from http://www.rstudio.com/You may also work on a classroom computer, all of which will have
R
and R Studio
pre-installed.
Academic Honesty
At Grinnell College you are part of a conversation among scholars, professors, and students, one that helps sustain both the intellectual community here and the larger world of thinkers, researchers, and writers. The tests you take, the research you do, the writing you submit-all these are ways you participate in this conversation.
The College presumes that your work for any course is your own contribution to that scholarly conversation, and it expects you to take responsibility for that contribution. That is, you should strive to present ideas and data fairly and accurately, indicate what is your own work, and acknowledge what you have derived from others. This care permits other members of the community to trace the evolution of ideas and check claims for accuracy.
Failure to live up to this expectation constitutes academic dishonesty. Academic dishonesty is misrepresenting someone else’s intellectual effort as your own. Within the context of a course, it also can include misrepresenting your own work as produced for that class when in fact it was produced for some other purpose. A complete list of dishonest behaviors, as defined by Grinnell College, can be found here.
Inclusive Classroom
Grinnell College makes reasonable accommodations for students with documented disabilities. To receive accommodations, students must provide documentation to the Coordinator for Disability Resources, information can be found here. If you plan on using accommodations in this course, you should speak with me as early as possible in the semester so that we can discuss ways to ensure your full participation in the course.
Religious Holidays
Grinnell College encourages students who plan to observe holy days that coincide with class meetings or assignment due dates to consult with your instructor in the first three weeks of classes so that you may reach a mutual understanding of how you can meet the terms of your religious observance, and the requirements of the course.
\(~\)
Engagement and Participation - 5%
Participation in a lab-heavy course is absolutely critical. During labs you are expected to help your partner(s) learn the material (which goes beyond simply answering the lab questions), and they are expected to help you. Everyone will begin the semester with a baseline participation score of 80%, which will then move up or down depending on my subjective assessment of your behavior during class. You can very quickly raise this score to 100% by doing a superb job helping your lab partner(s), and working diligently to understand course material during class. Alternatively, you can lower this score by skipping class, letting your lab partner(s) do most of the work, using your phone or surfing the web during class, etc. If you are ever unsure of your participation standing, you can email me and I am happy to provide you an interim assessment.
Labs - 20%
All labs contain embedded questions that you and your lab partner will answer together. Each group member should submit their own copy of the group’s answers via P-web prior to the assignment’s due-date. Oftentimes multiple labs will be due at the same time, and you are welcome to upload your answers as a single file. Some lab questions will be scored for accuracy with feedback given, while others will be scored for effort/completion. If it becomes clear that you are your partner are using a “divide and conquer” approach to answering lab questions your score on that assignment will be penalized. Sketch solutions will typically be posted 3-5 days after lab write-ups are due. Any lab that is turned in after sketch solutions are posted will receive no more than 50% credit.
Individual Homework - 20%
There will be 5-7 individual homework assignments throughout the semester. These assignments are naturally cumulative, and intended to force you to combine concepts and methods from multiple labs. You are welcome to get help from DASIL mentors or me on these assignments, but your submissions should be your own work, and you should add the name(s) of anyone you received help from as a footnote on the question that they helped you with. Most homework assignments will also include a short reading and a reflection.
Data Cleaning/Visualization Take-home - 10%
This assessment is intended to measure your ability to clean and manipulate complex data to arrive at a specific endpoint (recreating a data visualization). The task should take approximately 4 hours to complete, but will have 48-hours to work on the assessment once it is assigned.
You can find details on the take-home at this link, which will become active later in the semester.
Midterm Project - 20%
For this project, you will develop an R Shiny
app that
visually explores a data set of your choosing. You may work on this
project individually, or in a group of two. You will
deliver a 5-minute in-class presentation demonstrating the capabilities
of your app and discussing some of the trends in your data that your app
displays. You will be evaluated on both the features of your app and how
well you communicate what you see in the data.
You can find details on Midterm #2 at this link, which will become active later in the semester.
Final Project - 25%
This project is a start-to-finish data science application on a non-trivial data set. The final product is a three-page written report accompanied by R code and documentation. You may work on this project individually, or in a group of two or three.
You can find details on the final project at this link, which will become active later in the semester.
\(~\)
Why use R for Data Science?
If you’ve spent time reading about data science online you’ll
undoubtedly have noticed the emphasis placed on the Python programming
language. Indeed, research from Cal State University found Python was
the most popular data science language in private industry, being
mentioned in 42% of data scientist job postings. However,
R
, which was mentioned in 20% of job postings, is not far
behind and offers a few advantages when approaching data science from a
statistical perspective (hence this course having the STA prefix).
Both R
and Python provide plenty of functions for data
manipulation. However, because R was created by academic statisticians,
it offers very strong data visualization and statistical modeling
packages. On the other hand, Python is a general-purpose programming
language that excels in production, deployment, and machine learning.
Regardless of each language’s strengths and weaknesses, as an
introductory course our focus is on the fundamental skills and thought
processes used in data science – which is something that can be
accomplished regardless of the tools used (which will change over time
anyways).
Getting Help
In addition to visiting office hours and completing the recommended readings, there are many other ways in which you can find help on assignments and projects.
The Data
Science and Social Inquiry Lab (DASIL) is staffed by mentors who are
experienced in R
programming and may be able to
troubleshoot coding problems you are having. Many students who’ve
successfully completed this course have made extensive use of the DASIL
work space and its computing resources.
The online platform Stack Overflow
is a useful resource to find user-generated coding solutions to common
R
problems. Nearly all professional data scientists have
needed to “look up” a coding strategy on a site like Stack Overflow at
some point in their career, and I have no problem with you doing the
same on assignments or projects. However, if you make substantial use of
a Stack Overflow answer (ie: actually integrating lines of code written
by someone else into your work, not just getting help identifying the
right functions/arguments) the expectation is that you cite or
acknowledge doing so.
\(~\)
R
(data structures, functions,
packages, help documentation, and R Markdown
)ggplot2
and data visualization
principles)tidyr
, wrangling
using dplyr
, merging and joining, string processing and
regular expressions)plotly
, maps with
leaflet
, dashboards with R Shiny
)This syllabus and course are largely borrowed from Professor Ryan Miller who graciously permitted me to use them