Spectral Clustering

Spectral clustering is a graph-based method. It builds an affinity graph \(W\) from similarities (e.g., RBF kernel): \(W_{ij} = \exp(-\|x_i - x_j\|^2 / (2\sigma^2))\).

Define the degree matrix \(D\) where \(D_{ii} = \sum_j W_{ij}\), and (one version of) the normalized Laplacian: \(L = I - D^{-1/2} W D^{-1/2}\).

The algorithm embeds data into the space spanned by a few eigenvectors of \(L\) (or a related matrix) and then runs a simple clustering method (often K-means) in that embedding.

This family is particularly useful when clusters are not well-separated in Euclidean space but are separable by connectivity on a similarity graph.

In this lesson:

Fit spectral clustering with an RBF kernel.
Tune the kernel scale \(\sigma\) pragmatically.
Compare with K-means/hierarchical results.

Step 1: Import + preprocessing

library(readr)
library(dplyr)

local_path <- "raw_data/wholesale_customers.csv"
df_raw <- read_csv(local_path, show_col_types = FALSE)

spend_vars <- c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")

X <- df_raw |>
  mutate(across(all_of(spend_vars), ~ log1p(.x))) |>
  select(all_of(spend_vars)) |>
  mutate(across(everything(), scale)) |>
  as.matrix()

Step 2: Choose k (reuse from prior analyses or evaluate a grid)

Spectral clustering typically requires specifying \(k\). Here we illustrate a simple approach: test a small grid and select based on silhouette in the embedded labels.

library(cluster)
library(kernlab)

avg_sil_from_labels <- function(X, labels) {
  if (length(unique(labels)) < 2) return(NA_real_)
  s <- silhouette(labels, dist(X))
  mean(s[, 3])
}

k_grid <- 2:10
sigma_grid <- c(0.1, 0.2, 0.5, 1)

results <- expand.grid(k = k_grid, sigma = sigma_grid)
results$avg_sil <- NA_real_

set.seed(123)
for (i in seq_len(nrow(results))) {
  k <- results$k[i]
  sig <- results$sigma[i]
  
  sc <- specc(X, centers = k, kernel = "rbfdot", kpar = list(sigma = sig))
  results$avg_sil[i] <- avg_sil_from_labels(X, as.integer(sc))
}

results <- results[order(-results$avg_sil), ]
head(results, 10)

	k	sigma	avg_sil
28	2	1.0	0.2897449
10	2	0.2	0.2892797
19	2	0.5	0.2888352
1	2	0.1	0.2868777
11	3	0.2	0.2498515
2	3	0.1	0.2071218
12	4	0.2	0.1920744
21	4	0.5	0.1790377
3	4	0.1	0.1777185
5	6	0.1	0.1729661

Select the best configuration (highest silhouette), and then refit:

best_k <- results$k[1]
best_sigma <- results$sigma[1]

set.seed(123)
sc_best <- specc(X, centers = best_k, kernel = "rbfdot", kpar = list(sigma = best_sigma))
cl_sc <- as.integer(sc_best)

table(cl_sc)

## cl_sc
##   1   2 
## 254 186

Step 3: Visualization (PCA projection)

library(ggplot2)

pc <- prcomp(X)
plot_df <- data.frame(
  PC1 = pc$x[, 1],
  PC2 = pc$x[, 2],
  cluster = factor(cl_sc)
)

ggplot(plot_df, aes(PC1, PC2, color = cluster)) +
  geom_point(size = 2, alpha = 0.85) +
  theme_minimal() +
  labs(
    title = "Spectral clustering — PCA projection",
    subtitle = paste("k =", best_k, "| RBF sigma =", best_sigma),
    x = "PC1", y = "PC2"
  )

Step 4: Comparison with K-means (same k)

library(mclust)

set.seed(123)
km <- kmeans(X, centers = best_k, nstart = 50)

ari <- adjustedRandIndex(cl_sc, km$cluster)
ari

## [1] 0.893599

Interpretation:

Spectral clustering is preferred when K-means struggles due to non-convex or graph-structured clusters.
If ARI is high, it suggests the structure is already well captured by centroid-based clustering.

Step 5: Cluster profiles

df_prof <- as.data.frame(X)
df_prof$cluster <- factor(cl_sc)

profiles <- df_prof |>
  group_by(cluster) |>
  summarize(across(where(is.numeric), mean), .groups = "drop")

profiles

cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	0.2152193	-0.5720568	-0.6377129	0.2678828	-0.6509132	-0.1329263
2	-0.2939016	0.7811958	0.8708552	-0.3658184	0.8888815	0.1815230

Summary

Spectral clustering works on a similarity graph and can capture structures that are not well modeled by spherical clusters.
The kernel scale \(\sigma\) is crucial; a small grid search with an internal index (e.g., silhouette) is a practical approach.
Compare against K-means to justify switching clustering families rather than forcing centroid-based methods.

A work by Gianluca Sottile

gianluca.sottile@unipa.it