Usage
Basic usage
The following example demonstrates how to use coreset_sc.CoresetSpectralClustering to generate a graph using the stochastic block model (SBM) and perform spectral clustering with coreset sampling.
Vertices are assumed to have self loops since edges represent similarity between vertices.
from coreset_sc import CoresetSpectralClustering, gen_sbm
from sklearn.metrics.cluster import adjusted_rand_score
# Generate a graph from the stochastic block model
n = 1000 # number of nodes per cluster
k = 50 # number of clusters
p = 0.5 # probability of an intra-cluster edge
q = (1.0 / n) / k # probability of an inter-cluster edge
# A is a sparse scipy CSR matrix of a symmetric adjacency graph
A, ground_truth_labels = gen_sbm(n, k, p, q)
coreset_ratio = 0.1 # fraction of the data to use for the coreset graph
csc = CoresetSpectralClustering(
num_clusters=k, coreset_ratio=coreset_ratio
)
csc.fit(A) # sample, extract, and cluster the coreset graph
csc.label_full_graph() # label the rest of the graph given the coreset labels
pred_labels = csc.labels_ # get the full labels
# Alternatively, label the full graph in one line:
pred_labels = csc.fit_predict(A)
ari = adjusted_rand_score(ground_truth_labels, pred_labels)
print(f"Adjusted Rand Index: {ari:.2f}")
Advanced usage
There are additional parameters that can be set in the constructor of coreset_sc.CoresetSpectralClustering.
This example shows demostrates how to swap out Kmeans for MiniBatchKMeans, shift the implicit kernel matrix by a constant *D^{-1}, use a custom over sampling factor for seeding the coreset distribution, only cluster the coreset graph, and turn off warnings about negative kernel distances triggering clipping.
from coreset_sc import CoresetSpectralClustering, gen_sbm
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.cluster import MiniBatchKMeans
# Generate a graph from the stochastic block model
n = 1000 # number of nodes per cluster
k = 50 # number of clusters
p = 0.5 # probability of an intra-cluster edge
q = (1.0 / n) / k # probability of an inter-cluster edge
# A is a sparse scipy CSR matrix of a symmetric adjacency graph
A, ground_truth_labels = gen_sbm(n, k, p, q)
coreset_ratio = 0.1 # fraction of the data to use for the coreset graph
csc = CoresetSpectralClustering(
num_clusters=k,
coreset_ratio=coreset_ratio,
k_over_sampling_factor=5.0 # increase the number of samples for seeding the coreset distribution
shift = 0.25, # shift the implicit kernel matrix by 0.25* D^{-1}
kmeans_alg=MiniBatchKMeans(n_clusters=k, batch_size=2048), # use MiniBatchKMeans instead of KMeans,
full_labels=False # only cluster the coreset graph
ignore_warnings=True # turn off warnings about negative kernel distances triggering clipping
)
csc.fit(A) # sample, extract, and cluster the coreset graph.
coreset_labels = csc.coreset_labels_ # get the coreset labels
csc.label_full_graph() # label the rest of the graph given the coreset labels
pred_labels = csc.labels_ # get the full labels
# Alternatively, label the full graph in one line (ignores full_labels=False):
pred_labels = csc.fit_predict(A)
ari = adjusted_rand_score(ground_truth_labels, pred_labels)
print(f"Adjusted Rand Index: {ari:.2f}")
Custom coreset clustering algorithms
The following snippet shows how to use a custom clustering algorithm for clustering the coreset graph.