For this question, we will be looking at the Default
dataset sourced from the ISLR
package in R (but available
without downloading this package below). This is a simulated dataset
containing information on ten thousand customers, with the goal to
predict which customers will default on their credit card debt.
Default <- read.csv("https://collinn.github.io/data/default.csv")
Default$default_num <- ifelse(Default$default == "Yes", 1, 0)
Part A: Create a scatter plot of the Default data
in ggplot, with “balance” as the x-axis, “income” as the y-axis, and
using “default” (not “default_num”) status as the color aesthetic. Add
alpha = 0.25
to your geom_point()
to help with
the overplotting. Do you see any trends? Does default status tend to
associate more with income or credit card balance?
Part B: Construct a logistic regression model with
the data using default_num
as the outcome, with
balance
and income
as the predictor variables.
Use exp()
and coef()
to extract the
coefficients and exponentiate them so that the values are in terms of
odds ratios.
balance
coefficient. What
does it mean for the balance
variable to increase by 1, and
what impact does this have on the probability of default?Default
data? If so, how would you interpret the
intercept here?Part C: Now we are going to modify our covariates
using the scale()
function which modifies a numeric
variable by subtracting out its mean and dividing by its standard
deviation, the end result being a scaled variable with a mean
value of 0 and a standard deviation of 1.
mutate()
and scale()
to scale both
balance
and income
, naming these variables
balance.z
and income.z
(z is used to represent
scaled variables). Reconstruct the model that was created in part
B.balance.z
. What does it mean for balance.z
to
change by 1?Part D: Plot the ROC curve for the fitted model
using the plotROC
package. Because of a strange rounding
error in this function for this model, using the following variable as
arguments to cutoffs.at
and cutoff.labels
in
geom_roc()
cuts <- c(0.5,0.2, 0.1, 0.05, 0.01, 0.001)
## Use this
geom_roc(cutoffs.at = cuts, cutoff.labels = cuts)
Supposing that you were a bank who made loans to customers (and considering what priorities you may have as a business), which threshold on the ROC curve would you use to predict the default status of your customers? There is no one correct answer, but you should justify whichever value you choose.
For this question, we are going to investigate again the relationship between residuals from a fitted model and covariates that were not included. We will be using the simulated dataset below.
set.seed(123)
N <- 200
x1 <- runif(N, -2, 2) # Draw N uniform variables
x2 <- runif(N, -2, 2) # Draw another N variables
# Construct our outcome variable
y <- 2*x1 + 3*x2 + rnorm(N, sd = 2)
df <- data.frame(x1, x2, y)
Part A: Begin by creating a ggplot with
x1
on the x-axis and y
on the y-axis, adding
the layer
geom_smooth(method = lm, se = FALSE, color = 'red')
. Does
this appear to be a reasonable fit?
Part B: Add to the plot you made in part A an
aesthetic for color mapped to the variable x2
. What do you
notice? What happens when the residuals are positive vs negative? Taking
into account large and small (positive and negative) values of the
residuals along with large and small values of x2
, describe
how a plot would look with x2
on the x-axis and the
residuals on the y-axis
Part C: Use lm()
to create a model with
y ~ x1
, then plot x2
against the residuals.
How does this align with what you found in Part B?
For this question, we will be re-examining the impact of sample size on variability by assessing the confidence intervals associated with parameter estimation. In doing so, we are also going to extend the usefulness of our R skills by not simply manipulating data, but by simulating it. While we are not introducing anything entirely new, we include here a few snippets of code and examples that may prove useful.
First, we can write functions that do not take any arguments, yet
still return results. In this example, our function uses
rnorm()
to draw two random normal numbers and add them
together. As this process is random, we get a different result each time
we run addSim()
.
# Takes no argument but does return a number
addSim <- function() {
x <- rnorm(1)
y <- rnorm(1)
x + y
}
addSim()
## [1] -1.2422
addSim()
## [1] -0.66359
We can use the replicate()
function which takes our
addSim()
as an argument, along with a value
n = 1000
, indicating that we wish to run this particular
function 1000 times. The results are returned as a vector, and we use
mean()
to find the average sum from our simulation.
results <- replicate(n = 1000, addSim())
mean(results)
## [1] 0.02356
Sometimes our simulation returns multiple numbers, in which case
replicate()
returns an object called an array. We can get
this into a form that we are used to passing the
simplify = TRUE
argument to replicate()
.
Finally, replicate()
returns a matrix with each column
representing one interaction of the function making it very wide. We can
get it in long format using the transpose (flip rows and columns)
function, t()
summarySim <- function() {
x <- rnorm(100)
c(mean(x), var(x))
}
summarySim()
## [1] -0.031351 0.995095
results <- replicate(n = 1000, summarySim(), simplify = TRUE)
results <- t(results)
head(results)
## [,1] [,2]
## [1,] 0.123354 0.98863
## [2,] -0.157598 1.07245
## [3,] 0.020331 0.89356
## [4,] 0.029451 1.04335
## [5,] 0.090739 1.17278
## [6,] -0.031153 1.20700
Finally, there are times in which we may wish to store our results in a data frame. One way to do this, using the tools we have discussed in class, is to create a data frame with the appropriate number of columns and use positional indexing to assign results
# Create an NA data.frame
df <- data.frame(x = NA, y = NA)
for (i in 1:3) {
# Assign to ith row
df[i, ] <- c(i, i+10)
}
df
## x y
## 1 1 11
## 2 2 12
## 3 3 13
Part A: The construction of 95% confidence intervals is such that if we were to repeat our experiment with new data, the resulting confidence interval should contain the true value 95% of the time. We are going to verify this via simulation.
First create a function that does the following:
x
and
y
lm()
to fit a linear model with y
as
the outcome. As these values are random and unrelated, the true value of
\(\beta\) is equal to 0confint()
(?confint
) function to
extract the confidence interval associated with \(\beta\) (this corresponds to
parm = 2
in the confint
arguments)TRUE
or FALSE
Then, using replicate()
, run this simulation 1000 times.
What proportion of these simulations contained the true value of 0 in
their confidence intervals?
Part B: Modify the simulation function you wrote in
Part A so that it now includes an argument N
specifying how
many random numbers to draw before fitting a model. Repeat the
simulation in Part A, this time for values of N = 10, 50, and 100. Does
the proportion of simulations containing the true value of 0 change as N
gets larger?
Part C: For Part C, we will again use a simulation
function taking N
as an argument, except this time instead
of determining if the confidence interval contains zero, we will be
interested in returning the confidence interval from the simulation
itself (that is, our simulation will now return 2 numbers). We will be
running simulations of sample sizes with the following values of N:
N <- c(10, 20, 50, 100, 500, 1000)
For each N, we will do the following:
replicate(n = 1000, runSim(N), simplify = TRUE)
to
run the simulation 1000 timesreplicate()
, use
colMeans()
(or rowMeans()
) to compute the mean
upper and lower limits of the 95% confidence interval for that value of
N.Create a data frame using the results from your simulation. Your data
frame should have one row for each simulation with a column for
N
, as well as columns for the mean values of the upper and
lower confidence interval. Finally, create a scatter/line plot in ggplot
with factor(N)
(this will help with spacing) on the x-axis,
the value of the confidence interval on the y-axis, and assign a color
aesthetic according to whether it is the upper or lower limit.
Comment on the resulting plot. What do you notice? Does the marginal value of adding more observations remain the same as our sample size increases?