API Reference

ntac package initialization.

class ntac.FAFBData(edges_file, types_file, ignore_side=False, target_locations=None, top_regions_summary_file=None)

Bases: GraphData

Class for loading FAFB dataset data, extending GraphData with additional attributes.

Loads graph data (edges), node types, and optionally a top regions summary from CSV files. Provides data filtering (e.g., by target locations) and extracts extra features such as node locations and top regions.

edges_file

Path to the CSV file containing edge data.

Type:

str

types_file

Path to the CSV file containing node type/label data.

Type:

str

ignore_side

Flag to determine whether to ignore the ‘side’ information when processing locations.

Type:

bool

target_locations

Specific locations to filter the nodes.

Type:

list or set, optional

top_regions_summary_file

Path to the CSV file for top regions summary.

Type:

str, optional

(Additional attributes such as adjacency matrices, ground truth partitions,

features, and mappings are loaded by load_graph_and_partition().)

get_metrics(partition, indices, gt_labels, compute_class_acc=False)

Compute overall, top-k, region-level, and class-level accuracy metrics for FAFB data.

This function accepts either: - A “hard” partition as a 1-D array of length n (one label per node), or - A dictionary of the form {node_index: [(label₁, score₁), (label₂, score₂), …]}, as returned by nt.get_topk_partition(K).

Parameters:
  • partition (np.ndarray or dict) – If an array of shape (n,), each entry is a single predicted label. If a dict mapping node_index to a list of (label, score) pairs, this is interpreted as a top-k ranking for each node.

  • indices (array-like of int) – Subset of node indices over which to evaluate metrics (e.g., test set indices).

  • gt_labels (np.ndarray of shape (n,)) – Ground-truth labels for all nodes.

  • compute_class_acc (bool, optional) – If True, compute class-level top-k accuracy for each label in self.unique_labels. Default is False.

Returns:

A dictionary containing:

  • ’topk_acc’: dict mapping k → overall accuracy@k over indices.

  • ’topk_region_acc’: dict mapping region → list of [acc@1, acc@2, …, acc@K] (if locations exist).

  • ’region_acc’: dict mapping region → accuracy@1 (derived from topk_region_acc).

  • ’topk_class_acc’: dict mapping class_label → [acc@1, …, acc@K] (if compute_class_acc=True).

  • ’class_acc’: dict mapping class_label → accuracy@1 (if compute_class_acc=True).

  • Any additional keys returned by super().get_metrics(…), such as ‘acc’, ‘ari’, or ‘f1’.

Return type:

dict

get_neuron_ids(indices)

Retrieve the neuron IDs for a given set of node indices.

Parameters:

indices (np.array) – Array of node indices for which to retrieve neuron IDs.

Returns:

List of neuron IDs corresponding to the provided indices.

Return type:

list

load_graph_and_partition()

Loads graph structure and node data from CSV files; processes and filters data as required.

This method performs the following:
  1. Loads edges and types from the provided CSV files.

  2. Builds the complete node set and a mapping from node names to indices.

  3. Constructs the initial adjacency matrix (CSR format).

  4. Initializes ground truth labels, features, and locations from the types file.

  5. Optionally filters nodes based on target locations if specified.

  6. Optionally loads a top regions summary file to build capacity mappings.

Returns:

Contains the following elements in order:
  • adj_csr (scipy.sparse.csr_matrix): Adjacency matrix for the graph.

  • ground_truth_partition (np.array): Array of node labels (using self.unlabeled_symbol for missing labels).

  • idx_to_node (dict): Mapping from new node indices to original node names.

  • features (np.array): Node features array (assumed to have 3 features per node).

  • locations (list): List of location strings for each node.

  • top_regions (list or None): List of top in/out region(s) for each node, if available.

  • n (int): Number of nodes after filtering.

  • top_regions_summary (dict or None): Summary data for top regions, if provided.

  • cluster_capacities (dict or None): Mapping of cluster types to their expected capacities.

Return type:

tuple

class ntac.GraphData(adj_csr, labels)

Bases: object

Class for storing graph data and computing evaluation metrics.

adj_csr

The adjacency matrix in CSR format.

Type:

scipy.sparse.csr_matrix

adj_csc

The adjacency matrix converted to CSC format.

Type:

scipy.sparse.csc_matrix

labels

Array of labels (ground truth partitions) for each node.

Type:

np.array

n

The number of nodes in the graph.

Type:

int

unlabeled_symbol

Symbol used for nodes that have no label.

Type:

str

labeled_nodes

Indices of nodes that are labeled.

Type:

np.array

unique_labels

Unique set of labels found in the graph.

Type:

np.array

get_metrics(partition, indices, gt_labels, map_labels=False)

Calculate evaluation metrics over the specified node indices.

Computes the Adjusted Rand Index (ARI), macro F1 score, and accuracy comparing the given partition against the ground truth labels.

Parameters:
  • partition (np.array) – Array of predicted labels for each node.

  • indices (np.array) – Indices of the nodes to evaluate.

  • gt_labels (np.array) – Array of ground truth labels for nodes.

Returns:

A dictionary with metric names as keys and the corresponding scores:

  • ’ari’: Adjusted Rand Index

  • ’f1’: macro F1 score

  • ’acc’: classification accuracy

Return type:

dict

test_train_split(train_size=0.1, sampling_type='at_least_one_per_class', random_seed=None)

Split the labeled nodes into training and test sets using various sampling strategies.

Parameters:
  • train_size (float or int) – If < 1, the fraction of total labeled nodes to use as training; if ≥ 1, the absolute number of training samples.

  • sampling_type (str) – Type of sampling to use. One of: - “uniform”: Random sampling from all labeled nodes. - “at_least_one_per_class”: Ensure at least one sample per class, then randomly fill the remaining slots. This may exceed train_size if needed to satisfy the class constraint. - “exactly_k_per_class”: Sample exactly train_size nodes per class. - “stratified”: Sample nodes so that the class distribution in the training set mirrors the overall distribution.

  • num_per_class (int, optional) – Number of samples per class, used only in “exactly_k_per_class” mode.

Returns:

  • train_set (np.ndarray) – Array of indices for training nodes.

  • test_set (np.ndarray) – Array of indices for test nodes.

class ntac.Ntac(data, labels=None, lr=0.3, topk=1, verbose=False)

Bases: object

get_partition()

Returns the current partition after remapping the numeric labels back to the original labels.

Returns:

Array of labels corresponding to the current partition.

Return type:

np.ndarray

get_topk_partition(k=None)

Returns the top-k partition labels and their similarity scores for each node, formatted as a dict mapping node_index → list of (label, similarity) tuples.

Parameters:

k (int) – Number of top labels to return.

Returns:

A dictionary where each key is a node index (0…n-1) and each value is a list of k tuples (original_label, similarity_score), sorted by descending score.

Return type:

Dict[int, List[Tuple[label, float]]]

map_partition_to_gt_labels(gt_labels)

Given: - partition: 1D array of predicted labels (strings), length N - self.labels: 1D array of ground-truth labels (strings), length N

This uses only util functions (labels2clusters, match_clusters, cluster_labels) to align predicted clusters to ground-truth clusters and return a 1D array of matched GT‐label strings.

solve_unseeded(max_k, center_size=5, output_name=None, info_step=1, max_iterations=12, frac_seeds=0.1, chunk_size=6000)
step()

Performs one step of the seeded ntac algorithm.

This includes:
  1. Updating the partition from the similarity matrix.

  2. Computing the incremental embedding update for nodes whose partitions have changed.

  3. Blending the updated embedding with the previous state using the learning rate.

  4. Recomputing the similarity matrix to frozen nodes.

Also updates internal timing metrics for different operations.

class ntac.Visualizer(nt, data)

Bases: object

plot_acc_vs_class_size(metrics, bins=None, test_indices=None)

Plot class accuracy (y) vs. class size (x).

Parameters:
  • metrics – dict containing “class_acc”: {class_label: accuracy}.

  • bins – if provided, a sequence of bin edges to group class sizes; will plot mean accuracy per bin as bars.

  • test_indices – optional list/array of node indices. If given, only classes present in test_indices are shown, and class‐sizes are computed on that subset.

plot_class_accuracy(metrics)

Plot a horizontal bar chart of neuron class accuracies. :type metrics: :param metrics: Dictionary that contains key “class_acc”, a mapping of class -> accuracy.

plot_confusion_matrix(labels=None, normalize=False, cmap='Blues', fscore_threshold=1.0, include_labels=None)

Plot a confusion matrix comparing true and predicted class labels.

Parameters:
  • labels – list of all possible classes (for ordering). If None, derived from union of true and predicted.

  • normalize – if True, normalize each row to sum to 1.

  • cmap – colormap for the matrix.

  • fscore_threshold – show only labels whose F1 score is below this threshold.

  • include_labels – optional list or set of labels to include regardless of F1; used to restrict final plotted labels.

plot_embedding_comparison(class_name1, class_name2, show_error=False, use_gt_labels=False, min_threshold=0.0)

Plot a back-to-back horizontal bar chart comparing in- and out-degree embeddings for two classes. If the two class names are the same, the bars will align exactly.

Assumes:
  • self.nt.get_partition() returns predicted labels.

  • self.nt.embedding is (n_neurons, embedding_dim) with the first half corresponding to in-degree and the second half to out-degree.

  • self.nt.reverse_mapping maps bin indices (0 to half_dim-1) to type names.

plot_true_label_histogram(alg_class, top_k=None)

Plot a histogram of the ground truth labels among neurons that are assigned to a given predicted class (alg_class). If top_k is provided, only the top k most common labels will be shown.

Assumes:
  • self.nt.get_partition() returns predicted class labels.

  • self.data.labels contains the ground truth labels.

ntac.download_flywire_data(data_url='https://github.com/BenJourdan/ntac/releases/download/v0.1.0/dynamic_data.zip', cache_dir='/home/runner/.ntac', zip_path=None, data_dir=None, verbose=False)

Download and unzip the FlyWire data if not already cached.

Parameters:
  • data_url (str) – URL to download the FlyWire data from.

  • cache_dir (str) – Directory to cache the downloaded data.

  • zip_path (str) – Path to the zip file.

  • data_dir (str) – Directory to extract the data to.

  • verbose (bool) – If True, print progress messages.

ntac.sbm(n, k, p_in_range=(0.1, 0.9), p_out_range=(0.1, 0.9), seed=None)

Generate a synthetic undirected graph using the Stochastic Block Model (SBM).

This function creates an adjacency matrix and block labels using heterogeneous within-block (p_in_range) and between-block (p_out_range) connection probabilities.

Parameters:
  • n (int) – Total number of nodes.

  • k (int) – Number of blocks (clusters).

  • p_in_range (tuple of float) – Range from which to sample within-block probabilities.

  • p_out_range (tuple of float) – Range from which to sample between-block probabilities.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

  • A (np.ndarray) – Adjacency matrix of the generated graph (symmetric, unweighted).

  • labels (np.ndarray) – Array of block assignments for each node.