DBSCAN groups points based on density rather than distance to centroids. Key concepts:
DBSCAN can detect non-spherical clusters and identifies outliers/noise explicitly.
In this lesson
eps using kNN distance plot.library(readr)
library(dplyr)
local_path <- "raw_data/wholesale_customers.csv"
if (!file.exists(local_path)) {
dir.create("raw_data", showWarnings = FALSE)
download.file(url_path, local_path, mode = "wb")
}
df_raw <- read_csv(local_path, show_col_types = FALSE)
spend_vars <- c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")
X <- df_raw |>
mutate(across(all_of(spend_vars), ~ log1p(.x))) |>
select(all_of(spend_vars)) |>
mutate(across(everything(), scale)) |>
as.data.frame()A common heuristic sets \(\text{minPts} \approx 2p\) where \(p\) is the number of features.
## [1] 12
To choose eps, inspect the kNN distance plot: pick
eps near the “knee”.

Replace the horizontal line with your candidate value for
eps after inspecting the knee.
## DBSCAN clustering for 440 objects.
## Parameters: eps = 1, minPts = 12
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 182 noise points.
##
## 0 1
## 182 258
##
## Available fields: cluster, eps, minPts, metric, borderPoints
##
## 0 1
## 182 258
Interpretation: - If most points are noise, eps may be
too small or minPts too large. - If almost no noise and very few
clusters, eps may be too large (merging structures).
library(ggplot2)
pc <- prcomp(X)
plot_df <- data.frame(
PC1 = pc$x[, 1],
PC2 = pc$x[, 2],
cluster = factor(db$cluster)
)
ggplot(plot_df, aes(PC1, PC2, color = cluster)) +
geom_point(alpha = 0.8, size = 2) +
theme_minimal() +
labs(
title = "DBSCAN clusters (PCA projection)",
subtitle = "Cluster 0 denotes noise",
x = "PC1", y = "PC2"
)
Silhouette is not directly defined for noise. A practical approach is to compute silhouette on the subset with cluster label \(\ge 1\).
library(mclust)
k_db <- length(unique(cl_nonoise))
set.seed(123)
km <- kmeans(X_nonoise, centers = k_db, nstart = 50)
ari <- adjustedRandIndex(cl_nonoise, km$cluster)
ari## [1] 1
Interpretation:
For a business segmentation report:
minPts and eps are critical; the kNN
distance plot is a practical tool for selecting eps.A work by Gianluca Sottile
gianluca.sottile@unipa.it