Deep Neural Networks (DNNs) for Business Data

Deep neural networks approximate complex functions of the form
\(f_\theta : \mathbb{R}^p \to\) (for binary classification) by composing affine transformations and non-linear activation functions.

A feedforward network with \(L\) layers has the recursive form \[ h^{(0)} = x, \quad h^{(\ell)} = \sigma\big(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}\big),\ \ell = 1,\dots,L-1,\quad \hat{y} = \text{sigmoid}\big(w^{(L)} h^{(L-1)} + b^{(L)}\big) \] where:

\(W^{(\ell)}\), \(b^{(\ell)}\) are layer weights and biases.
\(\sigma(\cdot)\) is a non-linear activation (e.g. ReLU).
The final output \(\hat{y}\) is interpreted as \(P(Y=1 \mid X=x)\).

Training minimizes a loss function plus possible regularization, e.g.
binary cross-entropy with \(L_2\) penalty: \[ \mathcal{L}(\theta) = -\frac{1}{n}\sum_{i=1}^n \big[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \big] + \lambda \sum_{\ell} \| W^{(\ell)} \|_2^2 \] optimized by variants of stochastic gradient descent (Adam, RMSprop, etc.).

In this lesson we:

Build a reasonably deep network for bank marketing (term-deposit subscription).
Discuss architectural choices (layers, units, activations).
Add dropout and weight regularization.
Use validation sets, early stopping, and learning curves.
Evaluate predictive performance and interpret feature effects at a high level.

We work on a tabular, business-style dataset instead of standard image benchmarks, to align with typical applied use cases.

Step 1: Data import — Bank marketing dataset

We use a subset of the UCI Bank Marketing data (term-deposit subscription). Suppose you have saved a preprocessed version as raw_data/bank_marketing.csv (you can adapt the path/URL as needed).

path <- "raw_data/bank_marketing.csv"

bank <- read_csv2(path, show_col_types = FALSE)

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

glimpse(bank)

## Rows: 4,521
## Columns: 17
## $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, 40, 56, 37, 25, 31, 38, 42, 44, 4…
## $ job       <chr> "unemployed", "services", "management", "management", "blue-collar", "management", "self-empl…
## $ marital   <chr> "married", "married", "single", "married", "married", "single", "married", "married", "marrie…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary", "tertiary", "tertiary", "seconda…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ balance   <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 264, 1109, 502, 360, 194, 4073, 231…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no", "no",…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "yes",…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "cellular", "cellular", "cellular",…
## $ day       <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29, 27, 20, 23, 7, 18, 19, 12, 7, 30…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may", "apr", "may", "apr", "aug", "a…
## $ duration  <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 328, 261, 89, 189, 239, 114, 250, 1…
## $ campaign  <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, 2, 3, 2, 2, 3, 2, 1, 1, 2, 2, 2, …
## $ pdays     <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1, 241, -1, -1, 152, -1, 152, -1, -…
## $ previous  <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "failure", "other", "unknown", "unknow…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "yes", "no", "n…

Assume that:

y is the target (factor with levels “no”, “yes” or 0/1).
Other columns are numeric or categorical predictors (age, job, balance, contact metrics, etc.).

We will:

Convert the target to a binary factor with "no" as reference.
Keep a subset of useful predictors for clarity.

bank <- bank |>
  mutate(
    y = factor(y, levels = c("no", "yes"))
  )

# Example: keep some core predictors (adjust to your actual columns)
bank <- bank |>
  select(
    y, age, balance, duration, campaign, pdays, previous,
    job, marital, education, default, housing, loan, contact, poutcome
  )

summary(bank$y)

##   no  yes 
## 4000  521

Step 2: Train/validation/test split

We perform an 80/10/10 split with stratification on the target to preserve class proportions.

set.seed(123)

# Create indices
n <- nrow(bank)
idx <- sample.int(n)
bank_shuffled <- bank[idx, ]

# Simple 80/10/10 split
n_train <- floor(0.8 * n)
n_valid <- floor(0.1 * n)

train <- bank_shuffled[1:n_train, ]
valid <- bank_shuffled[(n_train + 1):(n_train + n_valid), ]
test  <- bank_shuffled[(n_train + n_valid + 1):n, ]

prop.table(table(train$y))

## 
##        no       yes 
## 0.8855088 0.1144912

prop.table(table(valid$y))

## 
##         no        yes 
## 0.90265487 0.09734513

prop.table(table(test$y))

## 
##        no       yes 
## 0.8609272 0.1390728

Step 3: Feature preprocessing for keras3

For tabular data with mixed types (numeric + categorical) we need to:

Normalize numeric features.
One-hot encode categorical features.

We build a simple preprocessing pipeline using base R; in more advanced setups you could use keras preprocessing layers or feature_spec utilities.

num_vars <- train |>
  select(where(is.numeric)) |>
  names()

cat_vars <- train |>
  select(where(is.character) | where(is.factor)) |>
  select(-y) |>
  names()

num_vars

## [1] "age"      "balance"  "duration" "campaign" "pdays"    "previous"

cat_vars

## [1] "job"       "marital"   "education" "default"   "housing"   "loan"      "contact"   "poutcome"

Standardize numeric features using training means and sds:

scale_numeric <- function(df, num_vars, center, scale) {
  df |>
    mutate(across(all_of(num_vars), ~ (.x - center[cur_column()]) / scale[cur_column()]))
}

num_means <- sapply(train[, num_vars, drop = FALSE], mean, na.rm = TRUE)
num_sds   <- sapply(train[, num_vars, drop = FALSE], sd,   na.rm = TRUE)

train_scaled <- scale_numeric(train, num_vars, num_means, num_sds)
valid_scaled <- scale_numeric(valid, num_vars, num_means, num_sds)
test_scaled  <- scale_numeric(test,  num_vars, num_means, num_sds)

One-hot encode categorical predictors using model.matrix():

make_design_matrix <- function(df) {
  mm <- model.matrix(
    ~ . - 1, 
    data = df |>
      select(-y) |>
      mutate(across(all_of(cat_vars), ~ factor(.x)))
  )
  as.matrix(mm)
}

X_train <- make_design_matrix(train_scaled)
X_valid <- make_design_matrix(valid_scaled)
X_test  <- make_design_matrix(test_scaled)

# Response as 0/1
y_train <- as.numeric(train_scaled$y == "yes")
y_valid <- as.numeric(valid_scaled$y == "yes")
y_test  <- as.numeric(test_scaled$y == "yes")

dim(X_train)

## [1] 3616   31

length(y_train)

## [1] 3616

Step 4: Deep network architecture — design choices

We now design a fully connected DNN for binary classification.

Key design components:

Depth and width:
- Start with 2–3 hidden layers (depth).
- Number of units per layer trade off capacity and overfitting; a common heuristic is 2–4 times the input dimension or a decreasing pattern (e.g., 128 → 64 → 32).
Activation functions:
- Hidden layers: ReLU (or variants) to avoid vanishing gradients.
- Output layer: sigmoid for binary classification.
Loss and optimization:
- Binary cross-entropy loss.
- Adam optimizer with a moderate learning rate (e.g., 1e-3).
Regularization:
- L2 weight decay on dense layers.
- Dropout layers between dense layers (e.g., dropout rate 0.3–0.5).
Validation and early stopping:
- Monitor validation loss.
- Use EarlyStopping callback with patience to stop training when performance stops improving.

Mathematically, each hidden layer is \[ h^{(\ell)} = \max(0, W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}), \] and dropout randomly sets a fraction \(p\) of units to zero during training, approximating an ensemble of thinned networks and acting as a regularizer.

Step 5: Implement the model in keras3

input_dim <- ncol(X_train)
input_dim

## [1] 31

Define the model:

library(keras3)

build_dnn_model <- function(input_dim, l2_lambda = 1e-4, dropout_rate = 0.3) {
  
  input <- layer_input(shape = input_dim, name = "features")
  
  x <- input |>
    layer_dense(units = 128, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
    layer_dropout(rate = dropout_rate) |>
    layer_dense(units = 64, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
    layer_dropout(rate = dropout_rate) |>
    layer_dense(units = 32, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda))
  
  output <- x |>
    layer_dense(units = 1, activation = "sigmoid", name = "output")
  
  model <- keras_model(inputs = input, outputs = output)
  
  model |>
    compile(
      optimizer = optimizer_adam(learning_rate = 1e-3),
      loss = "binary_crossentropy",
      metrics = list(
        "accuracy",
        metric_precision(name = "precision"),
        metric_recall(name = "recall")
      )
    )
  
  model
}

model <- build_dnn_model(input_dim)
model

## Model: "functional_11"
## ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
## ┃ Layer (type)                                    ┃ Output Shape                         ┃              Param # ┃
## ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
## │ features (InputLayer)                           │ (None, 31)                           │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_17 (Dense)                                │ (None, 128)                          │                4,096 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_8 (Dropout)                             │ (None, 128)                          │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_18 (Dense)                                │ (None, 64)                           │                8,256 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_9 (Dropout)                             │ (None, 64)                           │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_19 (Dense)                                │ (None, 32)                           │                2,080 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ output (Dense)                                  │ (None, 1)                            │                   33 │
## └─────────────────────────────────────────────────┴──────────────────────────────────────┴──────────────────────┘
##  Total params: 14,465 (56.50 KB)
##  Trainable params: 14,465 (56.50 KB)
##  Non-trainable params: 0 (0.00 B)

Step 6: Train with validation and early stopping

callback_es <- callback_early_stopping(
  monitor = "val_loss",
  patience = 10,
  restore_best_weights = TRUE
)

set.seed(123)
history <- model |>
  fit(
    x = X_train, y = y_train,
    validation_data = list(X_valid, y_valid),
    epochs = 200,
    batch_size = 256,
    callbacks = list(callback_es),
    verbose = 2
  )

Plot training curves:

plot(history) +
  theme_minimal() +
  ggtitle("Training and validation metrics — DNN")

Points to discuss:

Diverging training/validation curves indicate overfitting.
Early stopping truncates training once validation loss stops decreasing, selecting a model near the low point of validation loss.

Step 7: Evaluation on test set

metrics_test <- model |>
  evaluate(X_test, y_test, verbose = 0)

metrics_test

## $accuracy
## [1] 0.8830022
## 
## $loss
## [1] 0.2930982
## 
## $precision
## [1] 0.6190476
## 
## $recall
## [1] 0.4126984

For a clearer view:

named_metrics <- setNames(as.numeric(metrics_test), names(metrics_test))
round(named_metrics, 3)

##  accuracy      loss precision    recall 
##     0.883     0.293     0.619     0.413

We can also inspect the confusion matrix at threshold 0.5:

pred_prob <- model |> predict(X_test)
pred_class <- ifelse(pred_prob >= 0.5, 1, 0)

table(
  truth = factor(y_test, levels = c(0, 1), labels = c("no", "yes")),
  pred  = factor(pred_class, levels = c(0, 1), labels = c("no", "yes"))
)

##      pred
## truth  no yes
##   no  374  16
##   yes  37  26

Consider reporting accuracy, precision, recall, and F1; in marketing applications, recall for the positive class (“yes”) is often a key metric.

Step 8: Architecture sensitivity — width, depth, and regularization

You can quickly study architectural variations:

More or fewer layers.
Larger or smaller units.
Stronger or weaker L2 and dropout.

Example: a shallower model without strong regularization:

model_shallow <- keras_model_sequential() |>
  layer_dense(
    units = 32, activation = "relu", input_shape = input_dim
  ) |>
  layer_dense(units = 1, activation = "sigmoid")

model_shallow |>
  compile(
    optimizer = optimizer_adam(learning_rate = 1e-3),
    loss = "binary_crossentropy",
    metrics = "accuracy"
  )

history_shallow <- model_shallow |>
  fit(
    X_train, y_train,
    validation_data = list(X_valid, y_valid),
    epochs = 200,
    batch_size = 256,
    callbacks = list(callback_es),
    verbose = 0
  )

metrics_shallow <- model_shallow |>
  evaluate(X_test, y_test, verbose = 0)

c(
  DNN_regularized = round(named_metrics["accuracy"], 3),
  Shallow         = round(as.numeric(metrics_shallow["accuracy"]), 3)
)

## DNN_regularized.accuracy                  Shallow 
##                    0.883                    0.574

Discussion:

If the shallow model underperforms, the deeper network is exploiting additional representation power.
If the deeper model overfits while the shallow model generalizes better, you may need more regularization or simpler architectures.

Step 9: Practical considerations

Some key best practices when deploying DNNs for tabular business data:

Normalization: Always normalize numeric features; DNNs are sensitive to scale.
Class imbalance: Use class weights or focal losses when the positive class is rare.
Regularization:
- L2 weight decay to penalize large weights.
- Dropout between dense layers to mitigate co-adaptation.
Validation strategy:
- Keep a validation set that mirrors the operational distribution.
- Use early stopping to prevent overfitting.
Reproducibility:
- Fix seeds (R + Python) when possible.
- Log architectures, hyperparameters, and performance.

Summary

In this lesson you learned how to:

Formulate a deep neural network as a composition of affine maps and non-linear activations, trained by minimizing cross-entropy with regularization.
Design a DNN architecture for tabular business data, selecting depth, width, activations, and regularization (dropout, L2).
Implement the model in R with keras3 and TensorFlow, including data preprocessing, training with validation and early stopping, and evaluation on a held-out test set.
Compare alternative architectures and interpret their impact on generalization.

A work by Gianluca Sottile

gianluca.sottile@unipa.it

Deep Neural Networks in R — Keras3 & TensorFlow for Tabular Classification