Homework #5

Question 1 (Default Data)

For this question, we will be looking at the Default dataset sourced from the ISLR package in R (but available without downloading this package below). This is a simulated dataset containing information on ten thousand customers, with the goal to predict which customers will default on their credit card debt.

Default <- read.csv("https://collinn.github.io/data/default.csv")
Default$default_num <- ifelse(Default$default == "Yes", 1, 0)

Part A: Create a scatter plot of the Default data in ggplot, with “balance” as the x-axis, “income” as the y-axis, and using “default” (not “default_num”) status as the color aesthetic. Add alpha = 0.25 to your geom_point() to help with the overplotting. Do you see any trends? Does default status tend to associate more with income or credit card balance?

Part B: Construct a logistic regression model with the data using default_num as the outcome, with balance and income as the predictor variables. Use exp() and coef() to extract the coefficients and exponentiate them so that the values are in terms of odds ratios.

Consider the value of the balance coefficient. What does it mean for the balance variable to increase by 1, and what impact does this have on the probability of default?
For interpretation of the intercept to be meaningful, all of the other variables in the model must include \(0\) in the observed data. Is that the case for the Default data? If so, how would you interpret the intercept here?

Part C: Now we are going to modify our covariates using the scale() function which modifies a numeric variable by subtracting out its mean and dividing by its standard deviation, the end result being a scaled variable with a mean value of 0 and a standard deviation of 1.

Use mutate() and scale() to scale both balance and income, naming these variables balance.z and income.z (z is used to represent scaled variables). Reconstruct the model that was created in part B.
Again extract the coefficients and exponentiate them so that the values are in terms of odds ratio. Interpret the coefficient for balance.z. What does it mean for balance.z to change by 1?
Does the intercept have a meaningful interpretation for this model? If so, how would you interpret it?

Part D: Plot the ROC curve for the fitted model using the plotROC package. Because of a strange rounding error in this function for this model, using the following variable as arguments to cutoffs.at and cutoff.labels in geom_roc()

cuts <- c(0.5,0.2, 0.1, 0.05, 0.01, 0.001)

## Use this
geom_roc(cutoffs.at = cuts, cutoff.labels = cuts)

Supposing that you were a bank who made loans to customers (and considering what priorities you may have as a business), which threshold on the ROC curve would you use to predict the default status of your customers? There is no one correct answer, but you should justify whichever value you choose.

Question 2 (Residuals)

For this question, we are going to investigate again the relationship between residuals from a fitted model and covariates that were not included. We will be using the simulated dataset below.

set.seed(123)
N <- 200
x1 <- runif(N, -2, 2) # Draw N uniform variables
x2 <- runif(N, -2, 2) # Draw another N variables

# Construct our outcome variable
y <- 2*x1 + 3*x2 + rnorm(N, sd = 2)

df <- data.frame(x1, x2, y)

Part A: Begin by creating a ggplot with x1 on the x-axis and y on the y-axis, adding the layer geom_smooth(method = lm, se = FALSE, color = 'red'). Does this appear to be a reasonable fit?

Part B: Add to the plot you made in part A an aesthetic for color mapped to the variable x2. What do you notice? What happens when the residuals are positive vs negative? Taking into account large and small (positive and negative) values of the residuals along with large and small values of x2, describe how a plot would look with x2 on the x-axis and the residuals on the y-axis

Part C: Use lm() to create a model with y ~ x1, then plot x2 against the residuals. How does this align with what you found in Part B?

Question 3 (Simulation and R)

For this question, we will be re-examining the impact of sample size on variability by assessing the confidence intervals associated with parameter estimation. In doing so, we are also going to extend the usefulness of our R skills by not simply manipulating data, but by simulating it. While we are not introducing anything entirely new, we include here a few snippets of code and examples that may prove useful.

First, we can write functions that do not take any arguments, yet still return results. In this example, our function uses rnorm() to draw two random normal numbers and add them together. As this process is random, we get a different result each time we run addSim().

# Takes no argument but does return a number
addSim <- function() {
  x <- rnorm(1)
  y <- rnorm(1)
  x + y
}

addSim()

## [1] -1.2422

addSim()

## [1] -0.66359

We can use the replicate() function which takes our addSim() as an argument, along with a value n = 1000, indicating that we wish to run this particular function 1000 times. The results are returned as a vector, and we use mean() to find the average sum from our simulation.

results <- replicate(n = 1000, addSim())
mean(results)

## [1] 0.02356

Sometimes our simulation returns multiple numbers, in which case replicate() returns an object called an array. We can get this into a form that we are used to passing the simplify = TRUE argument to replicate(). Finally, replicate() returns a matrix with each column representing one interaction of the function making it very wide. We can get it in long format using the transpose (flip rows and columns) function, t()

summarySim <- function() {
  x <- rnorm(100)
  c(mean(x), var(x))
}
summarySim()

## [1] -0.031351  0.995095

results <- replicate(n = 1000, summarySim(), simplify = TRUE)
results <- t(results)
head(results)

##           [,1]    [,2]
## [1,]  0.123354 0.98863
## [2,] -0.157598 1.07245
## [3,]  0.020331 0.89356
## [4,]  0.029451 1.04335
## [5,]  0.090739 1.17278
## [6,] -0.031153 1.20700

Finally, there are times in which we may wish to store our results in a data frame. One way to do this, using the tools we have discussed in class, is to create a data frame with the appropriate number of columns and use positional indexing to assign results

# Create an NA data.frame
df <- data.frame(x = NA, y = NA)

for (i in 1:3) {
  # Assign to ith row
  df[i, ] <- c(i, i+10)
}

df

##   x  y
## 1 1 11
## 2 2 12
## 3 3 13

Part A: The construction of 95% confidence intervals is such that if we were to repeat our experiment with new data, the resulting confidence interval should contain the true value 95% of the time. We are going to verify this via simulation.

First create a function that does the following:

Draw 100 random observations for variables x and y
Use lm() to fit a linear model with y as the outcome. As these values are random and unrelated, the true value of \(\beta\) is equal to 0
Use the confint() (?confint) function to extract the confidence interval associated with \(\beta\) (this corresponds to parm = 2 in the confint arguments)
Determine if 0 is contained within your confidence interval. Your function should return TRUE or FALSE

Then, using replicate(), run this simulation 1000 times. What proportion of these simulations contained the true value of 0 in their confidence intervals?

Part B: Modify the simulation function you wrote in Part A so that it now includes an argument N specifying how many random numbers to draw before fitting a model. Repeat the simulation in Part A, this time for values of N = 10, 50, and 100. Does the proportion of simulations containing the true value of 0 change as N gets larger?

Part C: For Part C, we will again use a simulation function taking N as an argument, except this time instead of determining if the confidence interval contains zero, we will be interested in returning the confidence interval from the simulation itself (that is, our simulation will now return 2 numbers). We will be running simulations of sample sizes with the following values of N:

N <- c(10, 20, 50, 100, 500, 1000)

For each N, we will do the following:

Use replicate(n = 1000, runSim(N), simplify = TRUE) to run the simulation 1000 times
Using the results from replicate(), use colMeans() (or rowMeans()) to compute the mean upper and lower limits of the 95% confidence interval for that value of N.

Create a data frame using the results from your simulation. Your data frame should have one row for each simulation with a column for N, as well as columns for the mean values of the upper and lower confidence interval. Finally, create a scatter/line plot in ggplot with factor(N) (this will help with spacing) on the x-axis, the value of the confidence interval on the y-axis, and assign a color aesthetic according to whether it is the upper or lower limit.

Comment on the resulting plot. What do you notice? Does the marginal value of adding more observations remain the same as our sample size increases?