STA-290 HW 2

Question 1

This question will involve the penguins dataset.

pengy <- read.csv("https://collinn.github.io/data/penguins.csv")

Part A How many observations are included in the penguin dataset? What does each observation represent?

Part B How many of each species of penguin is included in the dataset? Which species has the greatest number of observations?

Part C What type of plot would be most appropriate to summarize the flipper length, measured in millimeters, of the penguins in the dataset? Produce this plot and comment on what you observe.

Part D Observing multiple potential centers in a distribution can often suggest the presence of multiple groups. How many centers do there appear to be? Create different plots with faceting to see if you can determine which “groups” might be present in the distribution of flipper length.

Part E Reproduce the following plot as closely as you can. (The palette for this color scheme is “Set1”)

Question 2

This question is Question 4.2 from the textbook and has been reproduced here. The dataset below contains the results from a poll based on a random sample with two variables: response, indicating their response to the poll question, and political, reporting their self-reported political ideology.

Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country.

## Copy and run this code to create table
immigration <- read.csv("https://collinn.github.io/data/immigrationpoll.csv")

Use the appropriate tables to answer the following questions:

Part A What percent of these Tampa, FL voters identify themselves as conservatives?

Part B What percent of these Tampa, FL voters are in favor of the citizenship option?

Part C What percent of these Tampa, FL voters identify themselves as conservatives and are in favor of the citizenship option?

Part D What percent of these Tampa, FL voters who identify themselves as conservatives are in favor of the citizenship option? What percent of moderates share this view? What percent of liberals share this view?

Part E Do political ideology and views on immigration appear to be associated? Explain your reasoning.

Question 3

This question uses the tips dataset. Recall that the variables are:

total_bill Total amount spent on meal
tip total tip
sex sex of the individual paying the bill
smoker indicator of whether or not the bill payer was a smoker
day Day of the week of the meal
time Time of day
size Number of people in the party

tips <- read.csv("https://collinn.github.io/data/tips.csv")

Part A Using the tips dataset, determine which day was the least popular for diners to visit. Which day was the most popular for having lunch?

Part B Given that a particular meal was “Dinner”, what is the probability that the person paying the bill was a male?

Part C Give the odds ratio comparing the odds of a female being a smoker compared to the odds of a male being a smoker (Hint: what value is considered an “event”?). Do sex and smoking status appear to be associated?

Part D Create the table with margin sums that best captures the information in the following plot. Be mindful of which should be the row and which should be the column.

Part E Create plot from the following table. Does there appear to be much association between

Smoker	Time
	Dinner	Lunch
No	0.602	0.662
Yes	0.398	0.338

Question 4

The table and plots below present the results of a social survey relating individuals’ income levels with their reported sense of happiness.

Income Level	Happiness			Total
	Not Too Happy	Pretty Happy	Very Happy	Sum
Above Average	21	159	110	290
Average	53	372	221	646
Below Average	94	249	93	436

Part A: Of all of the individuals who have a below average income, what proportion of them report being “Pretty Happy”?

Part B Would Plot 1 or Plot 2 be more useful in answering the question: “Which income group has the highest proportion of individuals who consider themselves ‘Not too Happy’?” Justify your answer.

Part C: For Part C, we are interested in answering the question, “Is having an above average income associated with being ‘very happy’?”

In pursuit of this, your friend notes: of all of the individuals who report being “Very Happy”, over 75% of them also report having either an average or below average income. In other words, most people who are “Very Happy” are not those who make an above average income. Based on this, we can conclude that having an above average income is not associated with somebody being “Very Happy”. Is this reasoning correct? If yes, explain why it is correct; if not, explain what mistake is being made and how you would answer differently.

Question 5

The following question involves the UCBAdmissions dataset collecting aggregate data on applications to graduate school at UC Berkeley for the six largest departments in 1973. Variables in this dataset include the department (A-E), the sex of the applicant, and the associated admissions decision. This data became famous in its use as evidence relating to claims of sex bias in admission practices. For this problem, we are going to investigate the veracity of these claims.

Start by copying the code block below to modify the data as it is built-in in R to something we are used to working with. Then, answer the following questions.

## Create UC Berkeley data from built-in R dataset
data("UCBAdmissions")
ucb <- as.data.frame(UCBAdmissions)
ucb <- ucb[rep(1:nrow(ucb), times = ucb$Freq), ]
ucb$Freq <- NULL

Part A Create the appropriate table to show how many applicants there were of each gender and whether or not they were admitted. State what you find in terms of the relationship between gender and admission status, based on an odds ratio.

Part B Create a frequency table showing how many applicants of each sex applied to each department. Which two departments had the greatest number of male applicants? Which had the greatest number of female applicants?

Part C Create a table of proportions showing the rate at which each department accepted and rejected candidates. Which two departments had the highest acceptance rate? Which had the lowest?

Part D Create a bar chart showing for each sex what proportion of them were admitted or rejected. What do you see from this chart?

Part E To the plot you made in Part D, now add faceting to also show the proportion of each sex accepted by department. Do the admission rates seem similar in each department, or are there any disparities? Based on what you have seen, what statements can you make about the existence of sex-based discrimination in admissions as Berkeley?