Usage ===== Basic usage ----------- The following example demonstrates how to use :class:`coreset_sc.CoresetSpectralClustering` to generate a graph using the stochastic block model (SBM) and perform spectral clustering with coreset sampling. Vertices are assumed to have self loops since edges represent similarity between vertices. .. code-block:: python from coreset_sc import CoresetSpectralClustering, gen_sbm from sklearn.metrics.cluster import adjusted_rand_score # Generate a graph from the stochastic block model n = 1000 # number of nodes per cluster k = 50 # number of clusters p = 0.5 # probability of an intra-cluster edge q = (1.0 / n) / k # probability of an inter-cluster edge # A is a sparse scipy CSR matrix of a symmetric adjacency graph A, ground_truth_labels = gen_sbm(n, k, p, q) coreset_ratio = 0.1 # fraction of the data to use for the coreset graph csc = CoresetSpectralClustering( num_clusters=k, coreset_ratio=coreset_ratio ) csc.fit(A) # sample, extract, and cluster the coreset graph csc.label_full_graph() # label the rest of the graph given the coreset labels pred_labels = csc.labels_ # get the full labels # Alternatively, label the full graph in one line: pred_labels = csc.fit_predict(A) ari = adjusted_rand_score(ground_truth_labels, pred_labels) print(f"Adjusted Rand Index: {ari:.2f}") Advanced usage --------------- There are additional parameters that can be set in the constructor of :class:`coreset_sc.CoresetSpectralClustering`. This example shows demostrates how to swap out Kmeans for MiniBatchKMeans, shift the implicit kernel matrix by a constant \*D^{-1}, use a custom over sampling factor for seeding the coreset distribution, only cluster the coreset graph, and turn off warnings about negative kernel distances triggering clipping. .. code-block:: python from coreset_sc import CoresetSpectralClustering, gen_sbm from sklearn.metrics.cluster import adjusted_rand_score from sklearn.cluster import MiniBatchKMeans # Generate a graph from the stochastic block model n = 1000 # number of nodes per cluster k = 50 # number of clusters p = 0.5 # probability of an intra-cluster edge q = (1.0 / n) / k # probability of an inter-cluster edge # A is a sparse scipy CSR matrix of a symmetric adjacency graph A, ground_truth_labels = gen_sbm(n, k, p, q) coreset_ratio = 0.1 # fraction of the data to use for the coreset graph csc = CoresetSpectralClustering( num_clusters=k, coreset_ratio=coreset_ratio, k_over_sampling_factor=5.0 # increase the number of samples for seeding the coreset distribution shift = 0.25, # shift the implicit kernel matrix by 0.25* D^{-1} kmeans_alg=MiniBatchKMeans(n_clusters=k, batch_size=2048), # use MiniBatchKMeans instead of KMeans, full_labels=False # only cluster the coreset graph ignore_warnings=True # turn off warnings about negative kernel distances triggering clipping ) csc.fit(A) # sample, extract, and cluster the coreset graph. coreset_labels = csc.coreset_labels_ # get the coreset labels csc.label_full_graph() # label the rest of the graph given the coreset labels pred_labels = csc.labels_ # get the full labels # Alternatively, label the full graph in one line (ignores full_labels=False): pred_labels = csc.fit_predict(A) ari = adjusted_rand_score(ground_truth_labels, pred_labels) print(f"Adjusted Rand Index: {ari:.2f}") Custom coreset clustering algorithms ------------------------------------ The following snippet shows how to use a custom clustering algorithm for clustering the coreset graph. .. code-block:: python from coreset_sc import CoresetSpectralClustering, gen_sbm from sklearn.metrics.cluster import adjusted_rand_score from sklearn.cluster import SpectralClustering # Generate a graph from the stochastic block model n = 1000 # number of nodes per cluster k = 50 # number of clusters p = 0.5 # probability of an intra-cluster edge q = (1.0 / n) / k # probability of an inter-cluster edge # A is a sparse scipy CSR matrix of a symmetric adjacency graph A, ground_truth_labels = gen_sbm(n, k, p, q) coreset_ratio = 0.1 # fraction of the data to use for the coreset graph csc = CoresetSpectralClustering( num_clusters=k, # required coreset_ratio=coreset_ratio, # Optional parameters: k_over_sampling_factor=2.0, shift=0.01, ) coreset_graph = csc.get_coreset_graph(A) sc = SpectralClustering( n_clusters=k, affinity='precomputed', random_state=42, ) coreset_labels = sc.fit_predict(coreset_graph) csc.set_coreset_graph_labels(coreset_labels) # Now label the full graph using the coreset labels csc.label_full_graph() pred_labels = csc.labels_ ari = adjusted_rand_score(ground_truth_labels, pred_labels) print(ari)