Individual Trees for Regression

First, an illustration of how this looks in contrast to standard regression

data(Hitters)
hit_tree <- rpart(Salary ~ Years + Hits, Hitters)

rpart.plot(hit_tree) # Without pruning

However, we can control the depth of the tree with pruning, mediated by the complexity parameter cp

plotcp(hit_tree)

rpart.plot(prune(hit_tree, cp = 0.071))

hitters

Individual trees for classification

crab <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", 
                   header = TRUE)

crab$y <- as.factor(crab$y)

ff <- rpart(y ~ width + color, data = crab)
rpart.plot(ff)

plotcp(ff)

rpart.plot(prune(ff, cp = 0.11))

rpart.plot(prune(ff, cp = 0.072))

rpart.plot(prune(ff, cp = 0.056))

rpart.plot(prune(ff, cp = 0.028))

rpart.plot(prune(ff, cp = 0.013))

Tree building

We need a metric for assessing where the splits should be. For each of these, let \(p_{mk}\) denote the propoprtion of observations in node \(m\) that are members of class \(k\)

Classification Error Rate: This is defined as the proportion of observations in node \(m\) that do not belong to the max class \(k\)

\[ E = 1 - max_k (\hat{p}_{mk}) \] However, this is not sensitive to different forms of error that may be relevant. For example, consider two nodes:

\[ \mathcal{N}_1 = \{A, A, A, A, A, B, B, B, B\}, \\ \mathcal{N}_2 = \{A, A, A, A, A, B, C, C, D\} \] these both have the same error, having 5/9 of the observations belonging to class A; yet Node 1 is superior in that it more cleanly partitions our classification space.

Gini Purity Index: The Gini Purity Index, or Gini index, is a measure of total variance across each of the classes in node M

\[ G = \sum_{k=1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}) \] This value will be minimized if all of the values for \(\hat{p}_{mk}\) are close to 0 or 1. In practice, this is the metric most often used

Sensitivy

One problem with having a single tree is that it can be highly sensitive to the particulars of the data. Here, we create two crab datasets that are simply bootstrapped samples of the first. Compare the resulting trees:

set.seed(123)
crab1 <- crab[sample(1:nrow(crab), nrow(crab), replace = TRUE), ]
crab2 <- crab[sample(1:nrow(crab), nrow(crab), replace = TRUE), ]

ff1 <- rpart(y ~ width + color, data = crab1, method = "class")
ff2 <- rpart(y ~ width + color, data = crab2, method = "class")

par(mfrow = c(2, 1))
rpart.plot(prune(ff1, 0.11))
title("Tree 1")
rpart.plot(prune(ff2, 0.11))
title("Tree 2")

Advantage and Disadvantage of Trees

Advantage

Easy to explain
Easy to display graphically
Can handle categorical variables without dummy vars
Makes decision making for clinicians easier

Disadvantage

Not much predictive accuracy
Sensitive to small changes

Ensemble Methods

Ensemble methods are those that combine many simple methods into one more powerful. In this context, it will refer to creating many trees and using them collectively

Bagging

The idea of bagging is pretty simple: create \(B\) bootstraps of your (training) data, construct a tree, \(\hat{f}^{*b}(x)\) for each, then average your predictions for each observation:

\[ \hat{f}_{\text{bag}}(x) = \frac1B \sum_{b=1}^B \hat{f}^{*b}(x) \] Classification, then, is determined by a majority vote across the trees.

The general idea is that averaging across a large number of bootstraps will leave the mean estimates the same while reducing variance; as such, we can use as large a value for \(B\) as we wish without fear of over-fitting.

The process for constructing these trees also implies a method for testing them. On average, each bootstrap will sample 2/3 of the total dataset, leaving 1/3 in each iteration unsampled. These are referred to as out-of-bag (OOB) observations. With \(B\) total bootstraps, each observation will be in the OOB category about \(B/3\) times. Computing error rates on these OOB observations at each iteration then allows us to estimate predictive value without the need for cross validation or partitioning our data into test/train sets

Random Forests

One problem with bagging is that the resulting trees are highly correlated. Random forests differ from bagging in that at each bootstrap, only a random subset of \(m < p\) predictors are allowed to be used, typically set around \(m = \sqrt{p}\) (where \(p\) is the total number of predictors). This means that for any given tree, fewer than half of the predictors are even available to perform splits.

The rational for this is as follows – supposing that a dataset has a particularly strong predictor, nearly every bagged tree will begin by splitting along this variable, resulting in a number of trees that look very similar. By contrast, random forest only allows a small subset of the total predictors to be used. This collection of random combinations ensures that the trees will be different. We don’t have to worry about trees with worthless predictors impacting us much, as the consistency of trees with strong predictors will “outvote” the disparate deciding poor trees once predictions are made.

library(randomForest)
library(caret)

ff <- randomForest(y ~ width + color, data = crab, ntree = 200)

## Various metrics
p1 <- predict(ff, crab)
caret::confusionMatrix(p1, factor(crab$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42 14
##          1 20 97
##                                        
##                Accuracy : 0.803        
##                  95% CI : (0.736, 0.86)
##     No Information Rate : 0.642        
##     P-Value [Acc > NIR] : 0.00000259   
##                                        
##                   Kappa : 0.563        
##                                        
##  Mcnemar's Test P-Value : 0.391        
##                                        
##             Sensitivity : 0.677        
##             Specificity : 0.874        
##          Pos Pred Value : 0.750        
##          Neg Pred Value : 0.829        
##              Prevalence : 0.358        
##          Detection Rate : 0.243        
##    Detection Prevalence : 0.324        
##       Balanced Accuracy : 0.776        
##                                        
##        'Positive' Class : 0            
##

## Plot of variable importance
varImpPlot(ff)

Evaluate regular accuracy

caret::confusionMatrix(p1, factor(crab$y))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42 14
##          1 20 97
##                                        
##                Accuracy : 0.803        
##                  95% CI : (0.736, 0.86)
##     No Information Rate : 0.642        
##     P-Value [Acc > NIR] : 0.00000259   
##                                        
##                   Kappa : 0.563        
##                                        
##  Mcnemar's Test P-Value : 0.391        
##                                        
##             Sensitivity : 0.677        
##             Specificity : 0.874        
##          Pos Pred Value : 0.750        
##          Neg Pred Value : 0.829        
##              Prevalence : 0.358        
##          Detection Rate : 0.243        
##    Detection Prevalence : 0.324        
##       Balanced Accuracy : 0.776        
##                                        
##        'Positive' Class : 0            
##

## Compare this with logistic
logfit <- glm(y ~ width + color, data = crab, family = binomial)

crab$pred1 <- predict(logfit, crab, type = "response") > 0.5
crab$pred2 <- predict(logfit, crab, type = "response") > 0.4
crab$pred3 <- predict(logfit, crab, type = "response") > 0.6


with(crab, xtabs(~ y + pred1)) ## Accuracy 0.728

##    pred1
## y   FALSE TRUE
##   0    29   33
##   1    14   97

with(crab, xtabs(~ y + pred2)) ## 0.71

##    pred2
## y   FALSE TRUE
##   0    21   41
##   1     9  102

with(crab, xtabs(~ y + pred3)) ## 0.69

##    pred3
## y   FALSE TRUE
##   0    39   23
##   1    29   82

Evaluate Predictive accuracy

## Create train index
idx <- sample(1:nrow(crab), size = floor(0.75*nrow(crab)))
testidx <- setdiff(1:nrow(crab), idx)

ctrain <- crab[idx, ]
ctest <- crab[testidx, ]

ff_train <- randomForest(y ~ width + color, data = ctrain, ntree = 200)
logfit_train <- glm(y ~ width + color, data = ctrain, family = binomial)

## Test forest on test set
forest_pred <- predict(ff_train, ctest)
caret::confusionMatrix(forest_pred, ctest$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  8  4
##          1  8 24
##                                        
##                Accuracy : 0.727        
##                  95% CI : (0.572, 0.85)
##     No Information Rate : 0.636        
##     P-Value [Acc > NIR] : 0.136        
##                                        
##                   Kappa : 0.377        
##                                        
##  Mcnemar's Test P-Value : 0.386        
##                                        
##             Sensitivity : 0.500        
##             Specificity : 0.857        
##          Pos Pred Value : 0.667        
##          Neg Pred Value : 0.750        
##              Prevalence : 0.364        
##          Detection Rate : 0.182        
##    Detection Prevalence : 0.273        
##       Balanced Accuracy : 0.679        
##                                        
##        'Positive' Class : 0            
##

## Test for glm
ctest$pred1 <- predict(logfit_train, ctest, type = "response") > 0.5
ctest$pred2 <- predict(logfit_train, ctest, type = "response") > 0.4
ctest$pred3 <- predict(logfit_train, ctest, type = "response") > 0.6


with(ctest, xtabs(~ y + pred1)) ## Accuracy 0.70

##    pred1
## y   FALSE TRUE
##   0     8    8
##   1     4   24

with(ctest, xtabs(~ y + pred2)) ## 0.77

##    pred2
## y   FALSE TRUE
##   0     7    9
##   1     2   26

with(ctest, xtabs(~ y + pred3)) ## 0.61

##    pred3
## y   FALSE TRUE
##   0     9    7
##   1     9   19

Example with large dataset

Predict one of four cancer types on 2308 gene expressions

data(Khan)

df <- data.frame(y = factor(Khan$ytrain), Khan$xtrain)

fit <- randomForest(y ~ ., data = df)

## Which genes most relevant
varImpPlot(fit)

Evaluate prediction accuracy with test set

## Accuracy
p1 <- predict(fit, df)
caret::confusionMatrix(p1, df$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.943, 1)         
##     No Information Rate : 0.365              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 1                  
##                                              
##  Mcnemar's Test P-Value : NA                 
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000    1.000     1.00    1.000
## Specificity             1.000    1.000     1.00    1.000
## Pos Pred Value          1.000    1.000     1.00    1.000
## Neg Pred Value          1.000    1.000     1.00    1.000
## Prevalence              0.127    0.365     0.19    0.317
## Detection Rate          0.127    0.365     0.19    0.317
## Detection Prevalence    0.127    0.365     0.19    0.317
## Balanced Accuracy       1.000    1.000     1.00    1.000

## Prediction Value
df_test <- data.frame(y = factor(Khan$ytest), Khan$xtest)

p2 <- predict(fit, df_test)
caret::confusionMatrix(p2, df_test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 3 0 0 0
##          2 0 6 1 0
##          3 0 0 4 0
##          4 0 0 1 5
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9           
##                  95% CI : (0.683, 0.988)
##     No Information Rate : 0.3           
##     P-Value [Acc > NIR] : 0.0000000377  
##                                         
##                   Kappa : 0.864         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity              1.00    1.000    0.667    1.000
## Specificity              1.00    0.929    1.000    0.933
## Pos Pred Value           1.00    0.857    1.000    0.833
## Neg Pred Value           1.00    1.000    0.875    1.000
## Prevalence               0.15    0.300    0.300    0.250
## Detection Rate           0.15    0.300    0.200    0.250
## Detection Prevalence     0.15    0.350    0.200    0.300
## Balanced Accuracy        1.00    0.964    0.833    0.967

Logisitic Regression with Categorical Predictors