Support Vector Machines are powerful classifiers that try to find a decision boundary with maximum margin. In this lesson we use an SVM with Radial Basis Function (RBF) kernel.
In this lesson we will: - Import Titanic data from a local CSV -
Clean the dataset (drop columns + fix types) - One-hot encode
categorical variables (SVM needs numeric predictors) - Create train/test
split - Train an SVM (RBF) with cross-validation - Evaluate accuracy +
confusion matrix - Tune sigma and C
library(dplyr)
path <- "raw_data/titanic_data.csv"
titanic <- read.csv(path, stringsAsFactors = FALSE)
dim(titanic)## [1] 1309 13
| x | pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29 | 0 | 0 | 24160 | 211.3375 | B5 | S | St Louis, MO |
| 2 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | Montreal, PQ / Chesterville, ON |
titanic_clean <- titanic |>
select(-any_of(c("home.dest", "cabin", "name", "X", "x", "ticket"))) |>
filter(embarked != "?") |>
mutate(
pclass = factor(
pclass,
levels = c(1, 2, 3),
labels = c("Upper", "Middle", "Lower")
),
survived = factor(survived, levels = c(0, 1), labels = c("No", "Yes")),
sex = factor(sex),
embarked = factor(embarked),
age = as.numeric(age),
fare = as.numeric(fare),
sibsp = as.numeric(sibsp),
parch = as.numeric(parch)
) |>
na.omit()
glimpse(titanic_clean)## Rows: 1,043
## Columns: 8
## $ pclass <fct> Upper, Upper, Upper, Upper, Upper, Upper, Upper, Upper, Upper…
## $ survived <fct> Yes, Yes, No, No, No, Yes, Yes, No, Yes, No, No, Yes, Yes, Ye…
## $ sex <fct> female, male, female, male, female, male, female, male, femal…
## $ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000, …
## $ sibsp <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1…
## $ parch <dbl> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1…
## $ fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77…
## $ embarked <fct> S, S, S, S, S, S, S, S, S, C, C, C, C, S, S, C, C, C, C, S, S…
##
## No Yes
## 618 425
set.seed(123)
n <- nrow(X_all)
idx_train <- sample.int(n, size = floor(0.8 * n))
X_train <- X_all[idx_train, , drop = FALSE]
y_train <- y_all[idx_train]
X_test <- X_all[-idx_train, , drop = FALSE]
y_test <- y_all[-idx_train]
c(n_train = nrow(X_train), n_test = nrow(X_test))## n_train n_test
## 834 209
Notes: - SVM is sensitive to feature scaling, so we apply
center and scale. - svmRadial in
caret tunes sigma and C (cost).
# install.packages(c("caret", "kernlab", "e1071"))
library(caret)
library(kernlab)
library(e1071)
set.seed(123)
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE
)
svm_base <- train(
x = X_train,
y = y_train,
method = "svmRadial",
metric = "Accuracy",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = 6
)
svm_base## Support Vector Machines with Radial Basis Function Kernel
##
## 834 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 668, 667, 667, 667, 667
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.7925619 0.5564354
## 0.50 0.7901739 0.5498575
## 1.00 0.7901883 0.5509784
## 2.00 0.7889979 0.5490202
## 4.00 0.7842003 0.5412947
## 8.00 0.7889835 0.5521731
##
## Tuning parameter 'sigma' was held constant at a value of 0.1276644
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1276644 and C = 0.25.
| sigma | C |
|---|---|
| 0.1276644 | 0.25 |
## [1] Yes Yes Yes Yes Yes No
## Levels: No Yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 114 24
## Yes 14 57
##
## Accuracy : 0.8182
## 95% CI : (0.7591, 0.868)
## No Information Rate : 0.6124
## P-Value [Acc > NIR] : 1.041e-10
##
## Kappa : 0.6081
##
## Mcnemar's Test P-Value : 0.1443
##
## Sensitivity : 0.8906
## Specificity : 0.7037
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.8028
## Prevalence : 0.6124
## Detection Rate : 0.5455
## Detection Prevalence : 0.6603
## Balanced Accuracy : 0.7972
##
## 'Positive' Class : No
##
A small explicit grid search (you can expand it if needed).
grid <- expand.grid(
sigma = c(0.001, 0.01, 0.05, 0.1),
C = c(0.25, 0.5, 1, 2, 4)
)
set.seed(123)
svm_tuned <- train(
x = X_train,
y = y_train,
method = "svmRadial",
metric = "Accuracy",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = grid
)
svm_tuned## Support Vector Machines with Radial Basis Function Kernel
##
## 834 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 668, 667, 667, 667, 667
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.001 0.25 0.7314840 0.4668946
## 0.001 0.50 0.7314840 0.4668946
## 0.001 1.00 0.7542385 0.5045648
## 0.001 2.00 0.7770002 0.5347124
## 0.001 4.00 0.7745978 0.5284890
## 0.010 0.25 0.7734074 0.5265858
## 0.010 0.50 0.7710050 0.5197962
## 0.010 1.00 0.7746122 0.5263817
## 0.010 2.00 0.7745834 0.5250352
## 0.010 4.00 0.7793738 0.5347287
## 0.050 0.25 0.7961547 0.5677955
## 0.050 0.50 0.7949571 0.5644389
## 0.050 1.00 0.7949571 0.5624966
## 0.050 2.00 0.7937739 0.5597145
## 0.050 4.00 0.7877931 0.5453531
## 0.100 0.25 0.7961547 0.5649726
## 0.100 0.50 0.7937667 0.5591207
## 0.100 1.00 0.7901883 0.5503677
## 0.100 2.00 0.7913859 0.5532221
## 0.100 4.00 0.7901883 0.5525087
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 0.25.
| sigma | C | |
|---|---|---|
| 16 | 0.1 | 0.25 |
Evaluate tuned model:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 114 24
## Yes 14 57
##
## Accuracy : 0.8182
## 95% CI : (0.7591, 0.868)
## No Information Rate : 0.6124
## P-Value [Acc > NIR] : 1.041e-10
##
## Kappa : 0.6081
##
## Mcnemar's Test P-Value : 0.1443
##
## Sensitivity : 0.8906
## Specificity : 0.7037
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.8028
## Prevalence : 0.6124
## Detection Rate : 0.5455
## Detection Prevalence : 0.6603
## Balanced Accuracy : 0.7972
##
## 'Positive' Class : No
##
You learned how to: - clean Titanic data and create numeric
predictors via one-hot encoding - train an SVM with RBF kernel using
cross-validation - scale features (center/scale) - tune
sigma and C and evaluate with a confusion
matrix
A work by Gianluca Sottile
gianluca.sottile@unipa.it