Deep Neural Networks (DNNs) for Business Data

Deep neural networks approximate complex functions of the form
\(f_\theta : \mathbb{R}^p \to\) (for binary classification) by composing affine transformations and non-linear activation functions.

A feedforward network with \(L\) layers has the recursive form \[ h^{(0)} = x, \quad h^{(\ell)} = \sigma\big(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}\big),\ \ell = 1,\dots,L-1,\quad \hat{y} = \text{sigmoid}\big(w^{(L)} h^{(L-1)} + b^{(L)}\big) \] where:

  • \(W^{(\ell)}\), \(b^{(\ell)}\) are layer weights and biases.
  • \(\sigma(\cdot)\) is a non-linear activation (e.g. ReLU).
  • The final output \(\hat{y}\) is interpreted as \(P(Y=1 \mid X=x)\).

Training minimizes a loss function plus possible regularization, e.g.
binary cross-entropy with \(L_2\) penalty: \[ \mathcal{L}(\theta) = -\frac{1}{n}\sum_{i=1}^n \big[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i) \big] + \lambda \sum_{\ell} \| W^{(\ell)} \|_2^2 \] optimized by variants of stochastic gradient descent (Adam, RMSprop, etc.).

In this lesson we:

  • Build a reasonably deep network for bank marketing (term-deposit subscription).
  • Discuss architectural choices (layers, units, activations).
  • Add dropout and weight regularization.
  • Use validation sets, early stopping, and learning curves.
  • Evaluate predictive performance and interpret feature effects at a high level.

We work on a tabular, business-style dataset instead of standard image benchmarks, to align with typical applied use cases.

Step 1: Data import — Bank marketing dataset

We use a subset of the UCI Bank Marketing data (term-deposit subscription). Suppose you have saved a preprocessed version as raw_data/bank_marketing.csv (you can adapt the path/URL as needed).

path <- "raw_data/bank_marketing.csv"

bank <- read_csv2(path, show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, 40, 56, 37, 25, 31, 38, 42, 44, 4…
## $ job       <chr> "unemployed", "services", "management", "management", "blue-collar", "management", "self-empl…
## $ marital   <chr> "married", "married", "single", "married", "married", "single", "married", "married", "marrie…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary", "tertiary", "tertiary", "seconda…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ balance   <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 264, 1109, 502, 360, 194, 4073, 231…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no", "no",…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "yes",…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "cellular", "cellular", "cellular",…
## $ day       <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29, 27, 20, 23, 7, 18, 19, 12, 7, 30…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may", "apr", "may", "apr", "aug", "a…
## $ duration  <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 328, 261, 89, 189, 239, 114, 250, 1…
## $ campaign  <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, 2, 3, 2, 2, 3, 2, 1, 1, 2, 2, 2, …
## $ pdays     <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1, 241, -1, -1, 152, -1, 152, -1, -…
## $ previous  <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "failure", "other", "unknown", "unknow…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "yes", "no", "n…

Assume that:

  • y is the target (factor with levels “no”, “yes” or 0/1).
  • Other columns are numeric or categorical predictors (age, job, balance, contact metrics, etc.).

We will:

  • Convert the target to a binary factor with "no" as reference.
  • Keep a subset of useful predictors for clarity.
bank <- bank |>
  mutate(
    y = factor(y, levels = c("no", "yes"))
  )

# Example: keep some core predictors (adjust to your actual columns)
bank <- bank |>
  select(
    y, age, balance, duration, campaign, pdays, previous,
    job, marital, education, default, housing, loan, contact, poutcome
  )

summary(bank$y)
##   no  yes 
## 4000  521

Step 2: Train/validation/test split

We perform an 80/10/10 split with stratification on the target to preserve class proportions.

set.seed(123)

# Create indices
n <- nrow(bank)
idx <- sample.int(n)
bank_shuffled <- bank[idx, ]

# Simple 80/10/10 split
n_train <- floor(0.8 * n)
n_valid <- floor(0.1 * n)

train <- bank_shuffled[1:n_train, ]
valid <- bank_shuffled[(n_train + 1):(n_train + n_valid), ]
test  <- bank_shuffled[(n_train + n_valid + 1):n, ]

prop.table(table(train$y))
## 
##        no       yes 
## 0.8855088 0.1144912
prop.table(table(valid$y))
## 
##         no        yes 
## 0.90265487 0.09734513
prop.table(table(test$y))
## 
##        no       yes 
## 0.8609272 0.1390728

Step 3: Feature preprocessing for keras3

For tabular data with mixed types (numeric + categorical) we need to:

  • Normalize numeric features.
  • One-hot encode categorical features.

We build a simple preprocessing pipeline using base R; in more advanced setups you could use keras preprocessing layers or feature_spec utilities.

num_vars <- train |>
  select(where(is.numeric)) |>
  names()

cat_vars <- train |>
  select(where(is.character) | where(is.factor)) |>
  select(-y) |>
  names()

num_vars
## [1] "age"      "balance"  "duration" "campaign" "pdays"    "previous"
cat_vars
## [1] "job"       "marital"   "education" "default"   "housing"   "loan"      "contact"   "poutcome"

Standardize numeric features using training means and sds:

scale_numeric <- function(df, num_vars, center, scale) {
  df |>
    mutate(across(all_of(num_vars), ~ (.x - center[cur_column()]) / scale[cur_column()]))
}

num_means <- sapply(train[, num_vars, drop = FALSE], mean, na.rm = TRUE)
num_sds   <- sapply(train[, num_vars, drop = FALSE], sd,   na.rm = TRUE)

train_scaled <- scale_numeric(train, num_vars, num_means, num_sds)
valid_scaled <- scale_numeric(valid, num_vars, num_means, num_sds)
test_scaled  <- scale_numeric(test,  num_vars, num_means, num_sds)

One-hot encode categorical predictors using model.matrix():

make_design_matrix <- function(df) {
  mm <- model.matrix(
    ~ . - 1, 
    data = df |>
      select(-y) |>
      mutate(across(all_of(cat_vars), ~ factor(.x)))
  )
  as.matrix(mm)
}

X_train <- make_design_matrix(train_scaled)
X_valid <- make_design_matrix(valid_scaled)
X_test  <- make_design_matrix(test_scaled)

# Response as 0/1
y_train <- as.numeric(train_scaled$y == "yes")
y_valid <- as.numeric(valid_scaled$y == "yes")
y_test  <- as.numeric(test_scaled$y == "yes")

dim(X_train)
## [1] 3616   31
length(y_train)
## [1] 3616

Step 4: Deep network architecture — design choices

We now design a fully connected DNN for binary classification.

Key design components:

  • Depth and width:
    • Start with 2–3 hidden layers (depth).
    • Number of units per layer trade off capacity and overfitting; a common heuristic is 2–4 times the input dimension or a decreasing pattern (e.g., 128 → 64 → 32).
  • Activation functions:
    • Hidden layers: ReLU (or variants) to avoid vanishing gradients.
    • Output layer: sigmoid for binary classification.
  • Loss and optimization:
    • Binary cross-entropy loss.
    • Adam optimizer with a moderate learning rate (e.g., 1e-3).
  • Regularization:
    • L2 weight decay on dense layers.
    • Dropout layers between dense layers (e.g., dropout rate 0.3–0.5).
  • Validation and early stopping:
    • Monitor validation loss.
    • Use EarlyStopping callback with patience to stop training when performance stops improving.

Mathematically, each hidden layer is \[ h^{(\ell)} = \max(0, W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}), \] and dropout randomly sets a fraction \(p\) of units to zero during training, approximating an ensemble of thinned networks and acting as a regularizer.

Step 5: Implement the model in keras3

input_dim <- ncol(X_train)
input_dim
## [1] 31

Define the model:

library(keras3)

build_dnn_model <- function(input_dim, l2_lambda = 1e-4, dropout_rate = 0.3) {
  
  input <- layer_input(shape = input_dim, name = "features")
  
  x <- input |>
    layer_dense(units = 128, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
    layer_dropout(rate = dropout_rate) |>
    layer_dense(units = 64, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda)) |>
    layer_dropout(rate = dropout_rate) |>
    layer_dense(units = 32, activation = "relu",
                kernel_regularizer = regularizer_l2(l = l2_lambda))
  
  output <- x |>
    layer_dense(units = 1, activation = "sigmoid", name = "output")
  
  model <- keras_model(inputs = input, outputs = output)
  
  model |>
    compile(
      optimizer = optimizer_adam(learning_rate = 1e-3),
      loss = "binary_crossentropy",
      metrics = list(
        "accuracy",
        metric_precision(name = "precision"),
        metric_recall(name = "recall")
      )
    )
  
  model
}

model <- build_dnn_model(input_dim)
model
## Model: "functional_11"
## ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
## ┃ Layer (type)                                    ┃ Output Shape                         ┃              Param # ┃
## ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
## │ features (InputLayer)                           │ (None, 31)                           │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_17 (Dense)                                │ (None, 128)                          │                4,096 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_8 (Dropout)                             │ (None, 128)                          │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_18 (Dense)                                │ (None, 64)                           │                8,256 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dropout_9 (Dropout)                             │ (None, 64)                           │                    0 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ dense_19 (Dense)                                │ (None, 32)                           │                2,080 │
## ├─────────────────────────────────────────────────┼──────────────────────────────────────┼──────────────────────┤
## │ output (Dense)                                  │ (None, 1)                            │                   33 │
## └─────────────────────────────────────────────────┴──────────────────────────────────────┴──────────────────────┘
##  Total params: 14,465 (56.50 KB)
##  Trainable params: 14,465 (56.50 KB)
##  Non-trainable params: 0 (0.00 B)

Step 6: Train with validation and early stopping

callback_es <- callback_early_stopping(
  monitor = "val_loss",
  patience = 10,
  restore_best_weights = TRUE
)

set.seed(123)
history <- model |>
  fit(
    x = X_train, y = y_train,
    validation_data = list(X_valid, y_valid),
    epochs = 200,
    batch_size = 256,
    callbacks = list(callback_es),
    verbose = 2
  )

Plot training curves:

plot(history) +
  theme_minimal() +
  ggtitle("Training and validation metrics — DNN")

Points to discuss:

  • Diverging training/validation curves indicate overfitting.
  • Early stopping truncates training once validation loss stops decreasing, selecting a model near the low point of validation loss.

Step 7: Evaluation on test set

metrics_test <- model |>
  evaluate(X_test, y_test, verbose = 0)

metrics_test
## $accuracy
## [1] 0.8830022
## 
## $loss
## [1] 0.2930982
## 
## $precision
## [1] 0.6190476
## 
## $recall
## [1] 0.4126984

For a clearer view:

named_metrics <- setNames(as.numeric(metrics_test), names(metrics_test))
round(named_metrics, 3)
##  accuracy      loss precision    recall 
##     0.883     0.293     0.619     0.413

We can also inspect the confusion matrix at threshold 0.5:

pred_prob <- model |> predict(X_test)
pred_class <- ifelse(pred_prob >= 0.5, 1, 0)

table(
  truth = factor(y_test, levels = c(0, 1), labels = c("no", "yes")),
  pred  = factor(pred_class, levels = c(0, 1), labels = c("no", "yes"))
)
##      pred
## truth  no yes
##   no  374  16
##   yes  37  26

Consider reporting accuracy, precision, recall, and F1; in marketing applications, recall for the positive class (“yes”) is often a key metric.

Step 8: Architecture sensitivity — width, depth, and regularization

You can quickly study architectural variations:

  • More or fewer layers.
  • Larger or smaller units.
  • Stronger or weaker L2 and dropout.

Example: a shallower model without strong regularization:

model_shallow <- keras_model_sequential() |>
  layer_dense(
    units = 32, activation = "relu", input_shape = input_dim
  ) |>
  layer_dense(units = 1, activation = "sigmoid")

model_shallow |>
  compile(
    optimizer = optimizer_adam(learning_rate = 1e-3),
    loss = "binary_crossentropy",
    metrics = "accuracy"
  )

history_shallow <- model_shallow |>
  fit(
    X_train, y_train,
    validation_data = list(X_valid, y_valid),
    epochs = 200,
    batch_size = 256,
    callbacks = list(callback_es),
    verbose = 0
  )

metrics_shallow <- model_shallow |>
  evaluate(X_test, y_test, verbose = 0)

c(
  DNN_regularized = round(named_metrics["accuracy"], 3),
  Shallow         = round(as.numeric(metrics_shallow["accuracy"]), 3)
)
## DNN_regularized.accuracy                  Shallow 
##                    0.883                    0.574

Discussion:

  • If the shallow model underperforms, the deeper network is exploiting additional representation power.
  • If the deeper model overfits while the shallow model generalizes better, you may need more regularization or simpler architectures.

Step 9: Practical considerations

Some key best practices when deploying DNNs for tabular business data:

  • Normalization: Always normalize numeric features; DNNs are sensitive to scale.
  • Class imbalance: Use class weights or focal losses when the positive class is rare.
  • Regularization:
    • L2 weight decay to penalize large weights.
    • Dropout between dense layers to mitigate co-adaptation.
  • Validation strategy:
    • Keep a validation set that mirrors the operational distribution.
    • Use early stopping to prevent overfitting.
  • Reproducibility:
    • Fix seeds (R + Python) when possible.
    • Log architectures, hyperparameters, and performance.

Summary

In this lesson you learned how to:

  • Formulate a deep neural network as a composition of affine maps and non-linear activations, trained by minimizing cross-entropy with regularization.
  • Design a DNN architecture for tabular business data, selecting depth, width, activations, and regularization (dropout, L2).
  • Implement the model in R with keras3 and TensorFlow, including data preprocessing, training with validation and early stopping, and evaluation on a held-out test set.
  • Compare alternative architectures and interpret their impact on generalization.
 

A work by Gianluca Sottile

gianluca.sottile@unipa.it