Support Vector Machine (SVM) on Titanic

Support Vector Machines are powerful classifiers that try to find a decision boundary with maximum margin. In this lesson we use an SVM with Radial Basis Function (RBF) kernel.

In this lesson we will: - Import Titanic data from a local CSV - Clean the dataset (drop columns + fix types) - One-hot encode categorical variables (SVM needs numeric predictors) - Create train/test split - Train an SVM (RBF) with cross-validation - Evaluate accuracy + confusion matrix - Tune sigma and C

Step 1: Import the data

library(dplyr)

path <- "raw_data/titanic_data.csv"
titanic <- read.csv(path, stringsAsFactors = FALSE)

dim(titanic)
## [1] 1309   13
head(titanic, 3)
x pclass survived name sex age sibsp parch ticket fare cabin embarked home.dest
1 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S St Louis, MO
2 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.55 C22 C26 S Montreal, PQ / Chesterville, ON
3 1 0 Allison, Miss. Helen Loraine female 2 1 2 113781 151.55 C22 C26 S Montreal, PQ / Chesterville, ON

Step 2: Clean and prepare data

titanic_clean <- titanic |>
  select(-any_of(c("home.dest", "cabin", "name", "X", "x", "ticket"))) |>
  filter(embarked != "?") |>
  mutate(
    pclass = factor(
      pclass,
      levels = c(1, 2, 3),
      labels = c("Upper", "Middle", "Lower")
    ),
    survived = factor(survived, levels = c(0, 1), labels = c("No", "Yes")),
    sex = factor(sex),
    embarked = factor(embarked),
    age = as.numeric(age),
    fare = as.numeric(fare),
    sibsp = as.numeric(sibsp),
    parch = as.numeric(parch)
  ) |>
  na.omit()

glimpse(titanic_clean)
## Rows: 1,043
## Columns: 8
## $ pclass   <fct> Upper, Upper, Upper, Upper, Upper, Upper, Upper, Upper, Upper…
## $ survived <fct> Yes, Yes, No, No, No, Yes, Yes, No, Yes, No, No, Yes, Yes, Ye…
## $ sex      <fct> female, male, female, male, female, male, female, male, femal…
## $ age      <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000, …
## $ sibsp    <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ parch    <dbl> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ fare     <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77…
## $ embarked <fct> S, S, S, S, S, S, S, S, S, C, C, C, C, S, S, C, C, C, C, S, S…
table(titanic_clean$survived)
## 
##  No Yes 
## 618 425

Encoding (SVM needs numeric predictors)

We use one-hot encoding with model.matrix().

X_all <- model.matrix(survived ~ . - 1, data = titanic_clean)
y_all <- titanic_clean$survived

dim(X_all)
## [1] 1043   10
table(y_all)
## y_all
##  No Yes 
## 618 425

Step 3: Train/test split

set.seed(123)

n <- nrow(X_all)
idx_train <- sample.int(n, size = floor(0.8 * n))

X_train <- X_all[idx_train, , drop = FALSE]
y_train <- y_all[idx_train]

X_test <- X_all[-idx_train, , drop = FALSE]
y_test <- y_all[-idx_train]

c(n_train = nrow(X_train), n_test = nrow(X_test))
## n_train  n_test 
##     834     209

Step 4: Train a baseline SVM (RBF) with cross-validation

Notes: - SVM is sensitive to feature scaling, so we apply center and scale. - svmRadial in caret tunes sigma and C (cost).

# install.packages(c("caret", "kernlab", "e1071"))
library(caret)
library(kernlab)
library(e1071)

set.seed(123)

ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE
)

svm_base <- train(
  x = X_train,
  y = y_train,
  method = "svmRadial",
  metric = "Accuracy",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 6
)

svm_base
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 834 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 668, 667, 667, 667, 667 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.7925619  0.5564354
##   0.50  0.7901739  0.5498575
##   1.00  0.7901883  0.5509784
##   2.00  0.7889979  0.5490202
##   4.00  0.7842003  0.5412947
##   8.00  0.7889835  0.5521731
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1276644
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1276644 and C = 0.25.
svm_base$bestTune
sigma C
0.1276644 0.25

Step 5: Predictions

pred_class <- predict(svm_base, newdata = X_test)
head(pred_class)
## [1] Yes Yes Yes Yes Yes No 
## Levels: No Yes

Step 6: Evaluation (confusion matrix + accuracy)

confusionMatrix(pred_class, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  114  24
##        Yes  14  57
##                                          
##                Accuracy : 0.8182         
##                  95% CI : (0.7591, 0.868)
##     No Information Rate : 0.6124         
##     P-Value [Acc > NIR] : 1.041e-10      
##                                          
##                   Kappa : 0.6081         
##                                          
##  Mcnemar's Test P-Value : 0.1443         
##                                          
##             Sensitivity : 0.8906         
##             Specificity : 0.7037         
##          Pos Pred Value : 0.8261         
##          Neg Pred Value : 0.8028         
##              Prevalence : 0.6124         
##          Detection Rate : 0.5455         
##    Detection Prevalence : 0.6603         
##       Balanced Accuracy : 0.7972         
##                                          
##        'Positive' Class : No             
## 

Step 7: Manual tuning grid (sigma, C)

A small explicit grid search (you can expand it if needed).

grid <- expand.grid(
  sigma = c(0.001, 0.01, 0.05, 0.1),
  C     = c(0.25, 0.5, 1, 2, 4)
)

set.seed(123)
svm_tuned <- train(
  x = X_train,
  y = y_train,
  method = "svmRadial",
  metric = "Accuracy",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneGrid = grid
)

svm_tuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 834 samples
##  10 predictor
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 668, 667, 667, 667, 667 
## Resampling results across tuning parameters:
## 
##   sigma  C     Accuracy   Kappa    
##   0.001  0.25  0.7314840  0.4668946
##   0.001  0.50  0.7314840  0.4668946
##   0.001  1.00  0.7542385  0.5045648
##   0.001  2.00  0.7770002  0.5347124
##   0.001  4.00  0.7745978  0.5284890
##   0.010  0.25  0.7734074  0.5265858
##   0.010  0.50  0.7710050  0.5197962
##   0.010  1.00  0.7746122  0.5263817
##   0.010  2.00  0.7745834  0.5250352
##   0.010  4.00  0.7793738  0.5347287
##   0.050  0.25  0.7961547  0.5677955
##   0.050  0.50  0.7949571  0.5644389
##   0.050  1.00  0.7949571  0.5624966
##   0.050  2.00  0.7937739  0.5597145
##   0.050  4.00  0.7877931  0.5453531
##   0.100  0.25  0.7961547  0.5649726
##   0.100  0.50  0.7937667  0.5591207
##   0.100  1.00  0.7901883  0.5503677
##   0.100  2.00  0.7913859  0.5532221
##   0.100  4.00  0.7901883  0.5525087
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 0.25.
svm_tuned$bestTune
sigma C
16 0.1 0.25

Evaluate tuned model:

pred_class2 <- predict(svm_tuned, newdata = X_test)
confusionMatrix(pred_class2, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  114  24
##        Yes  14  57
##                                          
##                Accuracy : 0.8182         
##                  95% CI : (0.7591, 0.868)
##     No Information Rate : 0.6124         
##     P-Value [Acc > NIR] : 1.041e-10      
##                                          
##                   Kappa : 0.6081         
##                                          
##  Mcnemar's Test P-Value : 0.1443         
##                                          
##             Sensitivity : 0.8906         
##             Specificity : 0.7037         
##          Pos Pred Value : 0.8261         
##          Neg Pred Value : 0.8028         
##              Prevalence : 0.6124         
##          Detection Rate : 0.5455         
##    Detection Prevalence : 0.6603         
##       Balanced Accuracy : 0.7972         
##                                          
##        'Positive' Class : No             
## 

Summary

You learned how to: - clean Titanic data and create numeric predictors via one-hot encoding - train an SVM with RBF kernel using cross-validation - scale features (center/scale) - tune sigma and C and evaluate with a confusion matrix

 

A work by Gianluca Sottile

gianluca.sottile@unipa.it