DBSCAN (Density-Based Clustering)

DBSCAN groups points based on density rather than distance to centroids. Key concepts:

  • \(\varepsilon\)-neighborhood: \(N_\varepsilon(x) = \{y: \|x-y\| \le \varepsilon\}\).
  • Core point: \(|N_\varepsilon(x)| \ge \text{minPts}\).
  • Border point: not core but density-reachable from a core point.
  • Noise: points not density-reachable from any core point.

DBSCAN can detect non-spherical clusters and identifies outliers/noise explicitly.

In this lesson

  • We preprocess business data as in previous lesson.
  • We choose eps using kNN distance plot.
  • We run DBSCAN and interpret clusters vs noise.
  • We compare DBSCAN with K-means on the non-noise subset.

Step 1: Import + preprocessing

library(readr)
library(dplyr)

local_path <- "raw_data/wholesale_customers.csv"

if (!file.exists(local_path)) {
  dir.create("raw_data", showWarnings = FALSE)
  download.file(url_path, local_path, mode = "wb")
}

df_raw <- read_csv(local_path, show_col_types = FALSE)

spend_vars <- c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")

X <- df_raw |>
  mutate(across(all_of(spend_vars), ~ log1p(.x))) |>
  select(all_of(spend_vars)) |>
  mutate(across(everything(), scale)) |>
  as.data.frame()

Step 2: Parameter selection (minPts and eps)

A common heuristic sets \(\text{minPts} \approx 2p\) where \(p\) is the number of features.

p <- ncol(X)
minPts <- max(5, 2 * p)
minPts
## [1] 12

To choose eps, inspect the kNN distance plot: pick eps near the “knee”.

library(dbscan)

kNNdistplot(X, k = minPts)
abline(h = 1, col = "red", lty = 2)

Replace the horizontal line with your candidate value for eps after inspecting the knee.

Step 3: Fit DBSCAN

eps <- 1.0  # adjust after inspecting the kNN plot

db <- dbscan(X, eps = eps, minPts = minPts)
db
## DBSCAN clustering for 440 objects.
## Parameters: eps = 1, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 182 noise points.
## 
##   0   1 
## 182 258 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints
table(db$cluster)  # 0 = noise
## 
##   0   1 
## 182 258

Interpretation: - If most points are noise, eps may be too small or minPts too large. - If almost no noise and very few clusters, eps may be too large (merging structures).

Step 4: Visualization (PCA projection)

library(ggplot2)

pc <- prcomp(X)
plot_df <- data.frame(
  PC1 = pc$x[, 1],
  PC2 = pc$x[, 2],
  cluster = factor(db$cluster)
)

ggplot(plot_df, aes(PC1, PC2, color = cluster)) +
  geom_point(alpha = 0.8, size = 2) +
  theme_minimal() +
  labs(
    title = "DBSCAN clusters (PCA projection)",
    subtitle = "Cluster 0 denotes noise",
    x = "PC1", y = "PC2"
  )

Step 5: Internal validation (silhouette on non-noise points)

Silhouette is not directly defined for noise. A practical approach is to compute silhouette on the subset with cluster label \(\ge 1\).

library(cluster)

idx <- which(db$cluster > 0)
X_nonoise <- X[idx, , drop = FALSE]
cl_nonoise <- db$cluster[idx]

if (length(unique(cl_nonoise)) >= 2) {
  d <- dist(X_nonoise)
  sil <- silhouette(cl_nonoise, d)
  summary(sil)
  plot(sil, border = NA, main = "Silhouette (non-noise subset)")
}

Step 6: Comparison with K-means (on non-noise subset)

library(mclust)

k_db <- length(unique(cl_nonoise))
set.seed(123)
km <- kmeans(X_nonoise, centers = k_db, nstart = 50)

ari <- adjustedRandIndex(cl_nonoise, km$cluster)
ari
## [1] 1

Interpretation:

  • Agreement is not expected to be perfect, because DBSCAN optimizes density connectivity, not SSE.
  • Differences often highlight non-spherical structure or outliers.

Step 7: Practical reporting

For a business segmentation report:

  • Report number of clusters found (excluding noise) and proportion of noise.
  • Provide cluster profiles (feature means in original units, not only z-scores).
  • Explain that DBSCAN can isolate atypical customers as noise/outliers, which may be a feature rather than a drawback.

Summary

  • DBSCAN is density-based and can recover non-spherical clusters.
  • minPts and eps are critical; the kNN distance plot is a practical tool for selecting eps.
  • Noise labeling provides a principled way to treat outliers.
  • Compare with K-means on the non-noise subset to understand structural differences.
 

A work by Gianluca Sottile

gianluca.sottile@unipa.it