Deep neural networks approximate complex functions of the form
\(f_\theta : \mathbb{R}^p \to\) (for
binary classification) by composing affine
transformations and non-linear activation
functions.
A feedforward network with \(L\) layers has the recursive form \[ h^{(0)} = x, \quad h^{(\ell)} = \sigma\big(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}\big),\ \ell = 1,\dots,L-1,\quad \hat{y} = \text{sigmoid}\big(w^{(L)} h^{(L-1)} + b^{(L)}\big) \] where:
Training minimizes a loss function plus possible
regularization, e.g.
binary cross-entropy with \(L_2\)
penalty: \[
\mathcal{L}(\theta) = -\frac{1}{n}\sum_{i=1}^n \big[ y_i \log \hat{y}_i
+ (1-y_i)\log(1-\hat{y}_i) \big]
+ \lambda \sum_{\ell} \| W^{(\ell)} \|_2^2
\] optimized by variants of stochastic gradient
descent (Adam, RMSprop, etc.).
In this lesson we:
We work on a tabular, business-style dataset instead of standard image benchmarks, to align with typical applied use cases.
We use a subset of the UCI Bank Marketing data (term-deposit
subscription). Suppose you have saved a preprocessed version as
raw_data/bank_marketing.csv (you can adapt the path/URL as
needed).
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 4,521
## Columns: 17
## $ age <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, 40, 56, 37, 25, 31, 38, 42, 44, 4…
## $ job <chr> "unemployed", "services", "management", "management", "blue-collar", "management", "self-empl…
## $ marital <chr> "married", "married", "single", "married", "married", "single", "married", "married", "marrie…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary", "tertiary", "tertiary", "seconda…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ balance <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 264, 1109, 502, 360, 194, 4073, 231…
## $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no", "no",…
## $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "yes",…
## $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "cellular", "cellular", "cellular",…
## $ day <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29, 27, 20, 23, 7, 18, 19, 12, 7, 30…
## $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may", "apr", "may", "apr", "aug", "a…
## $ duration <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 328, 261, 89, 189, 239, 114, 250, 1…
## $ campaign <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, 2, 3, 2, 2, 3, 2, 1, 1, 2, 2, 2, …
## $ pdays <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1, 241, -1, -1, 152, -1, 152, -1, -…
## $ previous <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, …
## $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "failure", "other", "unknown", "unknow…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "yes", "no", "n…
Assume that:
y is the target (factor with levels “no”, “yes” or
0/1).We will:
"no" as
reference.bank <- bank |>
mutate(
y = factor(y, levels = c("no", "yes"))
)
# Example: keep some core predictors (adjust to your actual columns)
bank <- bank |>
select(
y, age, balance, duration, campaign, pdays, previous,
job, marital, education, default, housing, loan, contact, poutcome
)
summary(bank$y)## no yes
## 4000 521
We perform an 80/10/10 split with stratification on the target to preserve class proportions.
set.seed(123)
# Create indices
n <- nrow(bank)
idx <- sample.int(n)
bank_shuffled <- bank[idx, ]
# Simple 80/10/10 split
n_train <- floor(0.8 * n)
n_valid <- floor(0.1 * n)
train <- bank_shuffled[1:n_train, ]
valid <- bank_shuffled[(n_train + 1):(n_train + n_valid), ]
test <- bank_shuffled[(n_train + n_valid + 1):n, ]
prop.table(table(train$y))##
## no yes
## 0.8855088 0.1144912
##
## no yes
## 0.90265487 0.09734513
##
## no yes
## 0.8609272 0.1390728
For tabular data with mixed types (numeric + categorical) we need to:
We build a simple preprocessing pipeline using base R; in more
advanced setups you could use keras preprocessing layers or
feature_spec utilities.
num_vars <- train |>
select(where(is.numeric)) |>
names()
cat_vars <- train |>
select(where(is.character) | where(is.factor)) |>
select(-y) |>
names()
num_vars## [1] "age" "balance" "duration" "campaign" "pdays" "previous"
## [1] "job" "marital" "education" "default" "housing" "loan" "contact" "poutcome"
Standardize numeric features using training means and sds:
scale_numeric <- function(df, num_vars, center, scale) {
df |>
mutate(across(all_of(num_vars), ~ (.x - center[cur_column()]) / scale[cur_column()]))
}
num_means <- sapply(train[, num_vars, drop = FALSE], mean, na.rm = TRUE)
num_sds <- sapply(train[, num_vars, drop = FALSE], sd, na.rm = TRUE)
train_scaled <- scale_numeric(train, num_vars, num_means, num_sds)
valid_scaled <- scale_numeric(valid, num_vars, num_means, num_sds)
test_scaled <- scale_numeric(test, num_vars, num_means, num_sds)One-hot encode categorical predictors using
model.matrix():
make_design_matrix <- function(df) {
mm <- model.matrix(
~ . - 1,
data = df |>
select(-y) |>
mutate(across(all_of(cat_vars), ~ factor(.x)))
)
as.matrix(mm)
}
X_train <- make_design_matrix(train_scaled)
X_valid <- make_design_matrix(valid_scaled)
X_test <- make_design_matrix(test_scaled)
# Response as 0/1
y_train <- as.numeric(train_scaled$y == "yes")
y_valid <- as.numeric(valid_scaled$y == "yes")
y_test <- as.numeric(test_scaled$y == "yes")
dim(X_train)## [1] 3616 31
## [1] 3616
We now design a fully connected DNN for binary classification.
Key design components:
EarlyStopping callback with patience to stop
training when performance stops improving.Mathematically, each hidden layer is \[ h^{(\ell)} = \max(0, W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}), \] and dropout randomly sets a fraction \(p\) of units to zero during training, approximating an ensemble of thinned networks and acting as a regularizer.
## [1] 31
Define the model:
library(keras3)
build_dnn_model <- function(input_dim, l2_lambda = 1e-4, dropout_rate = 0.3) {
input <- layer_input(shape = input_dim, name = "features")
x <- input |>
layer_dense(units = 128, activation = "relu",
kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
layer_dropout(rate = dropout_rate) |>
layer_dense(units = 64, activation = "relu",
kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
layer_dropout(rate = dropout_rate) |>
layer_dense(units = 32, activation = "relu",
kernel_regularizer = regularizer_l2(l = l2_lambda))
output <- x |>
layer_dense(units = 1, activation = "sigmoid", name = "output")
model <- keras_model(inputs = input, outputs = output)
model |>
compile(
optimizer = optimizer_adam(learning_rate = 1e-3),
loss = "binary_crossentropy",
metrics = list(
"accuracy",
metric_precision(name = "precision"),
metric_recall(name = "recall")
)
)
model
}
model <- build_dnn_model(input_dim)
model## Model: "functional_11"
## ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
## ┃ Layer (type) ┃ Output Shape ┃ Param # ┃
## ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
## │ features (InputLayer) │ (None, 31) │ 0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_17 (Dense) │ (None, 128) │ 4,096 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_8 (Dropout) │ (None, 128) │ 0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_18 (Dense) │ (None, 64) │ 8,256 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_9 (Dropout) │ (None, 64) │ 0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_19 (Dense) │ (None, 32) │ 2,080 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ output (Dense) │ (None, 1) │ 33 │
## └─────────────────────────────────────────────────┴──────────────────────────────────────┴──────────────────────┘
## Total params: 14,465 (56.50 KB)
## Trainable params: 14,465 (56.50 KB)
## Non-trainable params: 0 (0.00 B)
callback_es <- callback_early_stopping(
monitor = "val_loss",
patience = 10,
restore_best_weights = TRUE
)
set.seed(123)
history <- model |>
fit(
x = X_train, y = y_train,
validation_data = list(X_valid, y_valid),
epochs = 200,
batch_size = 256,
callbacks = list(callback_es),
verbose = 2
)Plot training curves:
Points to discuss:
## $accuracy
## [1] 0.8830022
##
## $loss
## [1] 0.2930982
##
## $precision
## [1] 0.6190476
##
## $recall
## [1] 0.4126984
For a clearer view:
## accuracy loss precision recall
## 0.883 0.293 0.619 0.413
We can also inspect the confusion matrix at threshold 0.5:
pred_prob <- model |> predict(X_test)
pred_class <- ifelse(pred_prob >= 0.5, 1, 0)
table(
truth = factor(y_test, levels = c(0, 1), labels = c("no", "yes")),
pred = factor(pred_class, levels = c(0, 1), labels = c("no", "yes"))
)## pred
## truth no yes
## no 374 16
## yes 37 26
Consider reporting accuracy, precision, recall, and F1; in marketing applications, recall for the positive class (“yes”) is often a key metric.
You can quickly study architectural variations:
Example: a shallower model without strong regularization:
model_shallow <- keras_model_sequential() |>
layer_dense(
units = 32, activation = "relu", input_shape = input_dim
) |>
layer_dense(units = 1, activation = "sigmoid")
model_shallow |>
compile(
optimizer = optimizer_adam(learning_rate = 1e-3),
loss = "binary_crossentropy",
metrics = "accuracy"
)
history_shallow <- model_shallow |>
fit(
X_train, y_train,
validation_data = list(X_valid, y_valid),
epochs = 200,
batch_size = 256,
callbacks = list(callback_es),
verbose = 0
)
metrics_shallow <- model_shallow |>
evaluate(X_test, y_test, verbose = 0)
c(
DNN_regularized = round(named_metrics["accuracy"], 3),
Shallow = round(as.numeric(metrics_shallow["accuracy"]), 3)
)## DNN_regularized.accuracy Shallow
## 0.883 0.574
Discussion:
Some key best practices when deploying DNNs for tabular business data:
In this lesson you learned how to:
A work by Gianluca Sottile
gianluca.sottile@unipa.it