Titanic Trees
Titanic Trees
Work through the kaggle Titanic competition found here with bagging, random forest, and GBM boosting.
Libraries I might Use
library(tidyverse)
library(class)
library(ISLR)
library(tidyverse)
library(randomForest)
library(gbm)
library(MASS)
library(rpart)
library(caTools)
Loading in the Data
train_data = read.csv("train.csv") # Train data Is really my sample overall
test_data = read.csv("test.csv") # New predictions will be performed on this data set
Encoding Factors and forcing NA for char columns
train_data[train_data==""] <- NA
train_data$Sex=factor(train_data$Sex, levels=c("male","female") ,labels = c(0,1))
train_data$Pclass=factor(train_data$Pclass, levels = c(1,2,3), labels = c(0,1,2))
train_data$Embarked=factor(train_data$Embarked, levels = c("C","Q","S"), labels = c(0,1,2))
train_data$Survived=factor(train_data$Survived)
train_data
Exploring the data
train_data
summary(train_data)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Min. : 1.0 0:549 0:216 Length:891 0:577 Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891 Min. : 0.00
1st Qu.:223.5 1:342 1:184 Class :character 1:314 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character 1st Qu.: 7.91
Median :446.0 2:491 Mode :character Median :28.00 Median :0.000 Median :0.0000 Mode :character Median : 14.45
Mean :446.0 Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20
3rd Qu.:668.5 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
Max. :891.0 Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33
NA's :177
Cabin Embarked
Length:891 0 :168
Class :character 1 : 77
Mode :character 2 :644
NA's: 2
I’m predicting Survived
Intuitively, I’d expect Pclass to be a significant predictor
Likewise, don’t women and children always go first? So age and sex should be valuable predictors.
SibSp represents the number of siblings/spouses aboard. Maybe important? If there are people with you they might help you escape, but you might also stay behind for them.
Parch is the number of parents or children abboard. If you are young I’d expect higher number of parents increases your chance, while if you are older the number of children might cause you to hold back trying to help them.
Information in fare is probably contained in class
Cabin/Ticket seem unlikely to provide much meaningful information
embarked is the port of embarkation. Unlikely to help much, but I’m not that confident about it.
All told our big factors are pclass, age, sex, sibsp, parch. Maybe embarked.
ggplot(train_data)+
geom_violin(aes(x=Survived, y = Age, fill = Survived))
Right away we are reminded of 177 missing age values seen inthe summary.
We can also see that the didnt survive group has has a heavier weight around the age of 25.
ggplot(train_data[which(is.na(train_data$Age)),])+
geom_bar(aes(x=Survived, fill = Survived))
Most of our missing values died
train_data[which(is.na(train_data$Age)),]
ggplot(train_data)+
geom_bar(aes(x=Pclass, fill = Survived))
Ticket class breakdown is as you would expect.
ggplot(train_data)+
geom_bar(aes(x=Embarked, fill = Survived))
There is more infromation that you would expect here. the final port of embarkation (southhampton) had the most people, but it also has the worst percentage of survival.
ggplot(train_data)+
geom_bar(aes(x=SibSp, fill = Survived))
Most people had 0 spouses or siblings aboard, the better percentage at 1 could be attributed directly to spouse, but that doesn’t hold up for the better percentage at 2
ggplot(train_data)+
geom_bar(aes(x=Parch, fill = Survived))
Increased survival at 1, 2, and 3. 1 and 2 are most likely being weighted by parents. 3 is interesting.
ggplot(train_data)+
geom_bar(aes(x=Sex, fill = Survived))
Women survived at a MUCH higher rate.
train_data$agegroup = ifelse(train_data$Age < 16, 0, ifelse(train_data$Age < 50, 1, 2))
ggplot(train_data)+
geom_bar(aes(x=agegroup, fill = Survived))
Children survive, those > 50 hardly do.
Dealing with Missing Values
Now that we have explored the data, lets deal with the missing values in embarked and age. We will be using survival, pclass, sex, age, sibsp, parch, and embarked to predict the others
clean_data = train_data[,c(2,3,5,6,7,8, 12)]
clean_data
Embarked
clean_data_embarked = clean_data[-which(is.na(train_data$Embarked)),]
clean_data_embarked
Test/Train Split
set.seed(1234)
split = sample.split(clean_data_embarked$Survived, SplitRatio = 0.8)
train = subset(clean_data_embarked, split == TRUE)
test = subset(clean_data_embarked, split == FALSE)
Model
set.seed(1234)
rf.Embarked = randomForest(Embarked~Survived + Pclass+Sex+SibSp+Parch, data = clean_data_embarked,importance =TRUE, mtry = 3)
importance(rf.Embarked)
0 1 2 MeanDecreaseAccuracy MeanDecreaseGini
Survived 0.1761412 23.926388 -0.7525674 10.47326 9.088543
Pclass 29.7102029 20.917772 17.0280234 33.91091 28.531706
Sex 1.1353457 25.138247 3.6891944 14.49255 9.606021
SibSp 5.0916546 3.144096 9.8194733 12.15916 18.989624
Parch 4.9348200 19.201793 16.3263343 21.33507 18.814807
yhat.Embarked = predict(rf.Embarked,newdata=test, type="response")
confusion_matrix = table(test$Embarked,yhat.Embarked)
confusion_matrix
yhat.Embarked
0 1 2
0 15 0 23
1 0 7 9
2 6 1 117
124/(121+16+30+2+3)
[1] 0.7209302
Not great predictive power, but basically we only are replacing two values. (with 2 most likely) I could do more here by looking at other predictors that ARE probably predictive of location like fare and ticket, but for the sake of time I’m just going to leave it at embarkation = 2
train_data$Embarked[which(is.na(train_data$Embarked))] = 2
clean_data$Embarked[which(is.na(clean_data$Embarked))] = 2
Age
clean_data_age = clean_data[-which(is.na(train_data$Age)),]
clean_data_age = clean_data_age[,-1] #Can't use survived here because OG test set doesn't tell me survived.
clean_data_age
Test/Train Split
set.seed(1234)
split = sample.split(clean_data_age$Age, SplitRatio = 0.8)
train = subset(clean_data_age, split == TRUE)
test = subset(clean_data_age, split == FALSE)
Model
set.seed(1234)
rf.Age = randomForest(Age~., data = clean_data_age,importance =TRUE, mtry = 4)
importance(rf.Age)
%IncMSE IncNodePurity
Pclass 87.248524 21086.175
Sex 16.198176 5410.003
SibSp 34.869585 15898.640
Parch 53.452846 21110.849
Embarked 9.446469 5809.567
yhat.Age = predict(rf.Age,newdata=test)
mean((yhat.Age-test$Age)^2)
[1] 110.6331
ooooph that is a a not great mean squared error, but lets keep rolling with it for now.
clean_data$Age[which(is.na(clean_data$Age))] = predict(rf.Age, newdata = clean_data[which(is.na(clean_data$Age)),])
train_data$Age[which(is.na(train_data$Age))] = predict(rf.Age, newdata = train_data[which(is.na(train_data$Age)),])
Modedl Building
clean_data
set.seed(1234)
split = sample.split(clean_data$Survived, SplitRatio = 0.8)
train = subset(clean_data, split == TRUE)
test = subset(clean_data, split == FALSE)
rf.Survived = randomForest(Survived~., data = train,importance =TRUE)
yhat.Survived = predict(rf.Survived,newdata=test, type="response")
confusion_matrix = table(test$Survived,yhat.Survived)
confusion_matrix
yhat.Survived
0 1
0 92 18
1 20 48
140/178
[1] 0.7865169
not bad, but how does it compare to gender?
yhat.gender = ifelse(test$Sex == 1, 1, 0)
confusion_matrix = table(test$Survived,yhat.gender)
confusion_matrix
yhat.gender
0 1
0 89 21
1 21 47
136/178
[1] 0.7640449
Better, but not THAT much better. Let’s go ahead and clean up the test data and submit
Test Data Cleanup
test_data
test_data
test_data$Sex=factor(test_data$Sex, levels=c("male","female") ,labels = c(0,1))
test_data$Pclass=factor(test_data$Pclass, levels = c(1,2,3), labels = c(0,1,2))
test_data$Embarked=factor(test_data$Embarked, levels = c("C","Q","S"), labels = c(0,1,2))
test_data_clean = test_data[,c(2,4,5,6,7,11)]
test_data_clean
summary(test_data_clean)
Pclass Sex Age SibSp Parch Embarked
0:107 0:266 Min. : 0.17 Min. :0.0000 Min. :0.0000 0:102
1: 93 1:152 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 1: 46
2:218 Median :27.00 Median :0.0000 Median :0.0000 2:270
Mean :30.27 Mean :0.4474 Mean :0.3923
3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :76.00 Max. :8.0000 Max. :9.0000
NA's :86
Only missing values in Age
test_data_clean$Age[which(is.na(test_data_clean$Age))]=predict(rf.Age,test_data_clean[which(is.na(test_data_clean$Age)),])
test_data_clean
Rfguess = data.frame(test_data$PassengerId, predict(rf.Survived,test_data_clean))
colnames(Rfguess) <- c("PassengerId","Survived")
Rfguess
write.csv(Rfguess, "rfguess.csv",row.names = FALSE)
Final score: 0.77751, not great but I’m in the top quarter and improved over the gender guesses at least!
bagged.Survived = randomForest(Survived~., data = train,importance =TRUE, mtry = 6)
yhat.Survived = predict(bagged.Survived,newdata=test, type="response")
confusion_matrix = table(test$Survived,yhat.Survived)
confusion_matrix
yhat.Survived
0 1
0 87 23
1 19 49
110/188
[1] 0.5851064
a lot worse
Baggedguess = data.frame(test_data$PassengerId, predict(bagged.Survived,test_data_clean))
colnames(Baggedguess) <- c("PassengerId","Survived")
write.csv(Baggedguess, "rfguess.csv",row.names = FALSE)
boost.Survived = gbm(as.character(Survived)~., data = train, distribution="bernoulli", n.tree = 5000)
best.iter <- gbm.perf(boost.Survived, method = "OOB", plot = FALSE)
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.
best.iter
[1] 92
attr(,"smoother")
Call:
loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
length(x)/10), 50))
Number of Observations: 5000
Equivalent Number of Parameters: 39.99
Residual Standard Error: 0.000677
prob = predict(boost.Survived, newdata=test, best.iter,type="response")
yhat.Survived = ifelse(prob>.5, 1, 0)
confusion_matrix = table(test$Survived,yhat.Survived)
confusion_matrix
yhat.Survived
0 1
0 92 18
1 16 52
140/178
[1] 0.7865169
Same as RF
boostedguess = data.frame(test_data$PassengerId, predict(boost.Survived,test_data_clean))
Using 5000 trees...
colnames(boostedguess) <- c("PassengerId","Survived")
write.csv(boostedguess, "boguess.csv",row.names = FALSE)