# TOBA
**Repository Path**: nortonii/TOBA
## Basic Information
- **Project Name**: TOBA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-10-27
- **Last Updated**: 2023-10-27
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README

Topological Augmentation for Class-Imbalanced Node Classification
**ToBA** (Topological Balanced Augmentation) is a **lightweight, plug-and-play** graph data augmentation technique for class-imbalanced node classification tasks. It aims to mitigate the **class-imbalance bias** introduced by **ambivalent and distant message-passing** by dynamically identifying and rectifying nodes that are exposed to such issues. Our **ToBA** implementation features:
- 🍎 **Plug-and-play**: model-agnostic augmentation that directly integrates into the training loop.
- 🍎 **Effectiveness**: boosting classification performance, while reducing predictive bias.
- 🍎 **Versatility**: work with various GNN architectures and imbalance-handling techniques.
- 🍎 **Lightweight**: light computational overhead, no additional hyperparameters.
- 🍎 **Ease-of-use**: unified, concise, and extensible API design.
Intergrating [`TopoBalanceAugmenter`](https://github.com/AnonAuthorAI/ToBA/blob/main/toba.py#L170) (**ToBA**) into your training loop with <5 lines of code:
```python
from toba import TopoBalanceAugmenter
augmenter = TopoBalanceAugmenter().init_with_data(data)
for epoch in range(epochs):
# augment the graph
x, edge_index, _ = augmenter.augment(model, x, edge_index)
y, train_mask = augmenter.adapt_labels_and_train_mask(y, train_mask)
# original training step
model.update(x, y, edge_index, train_mask)
```
### Table of Contents
- [Usage](#usage)
- [Command line](#command-line)
- [Jupyter Notebook](#jupyter-notebook)
- [API reference](#api-reference)
- [TopoBalanceAugmenter](#topobalanceaugmenter)
- [NodeClassificationTrainer](#nodeclassificationtrainer)
- [Emprical Results](#emprical-results)
- [Experimental Setup](#experimental-setup)
- [On the effectiveness and versatility of TOBA](#on-the-effectiveness-and-versatility-of-toba)
- [On the robustness of TOBA](#on-the-robustness-of-toba)
- [On mitigating AMP and DMP](#on-mitigating-amp-and-dmp)
- [References](#references)
## Usage
### Command line
[`train.py`](https://github.com/AnonAuthorAI/ToBA/blob/main/train.py) provides a simple way to test ToBA under different settings: datasets, imbalance types, imbalance ratios, GNN architectures, etc. For example, to test ToBA's effectiveness on the Cora dataset with a 10:1 step imbalance ratio using the GCN architecture, simply run:
```bash
python train.py --dataset cora --imb_type step --imb_ratio 10 --gnn_arch GCN --toba_mode all
```
Output:
```
================= Dataset [Cora] - StepIR [10] - ToBA [dummy] =================
Best Epoch: 97 | train/val/test | ACC: 100.0/67.20/67.50 | BACC: 100.0/61.93/60.55 | MACRO-F1: 100.0/59.65/59.29 | upd/aug time: 4.67/0.00ms | node/edge ratio: 100.00/100.00%
Best Epoch: 67 | train/val/test | ACC: 100.0/65.20/65.00 | BACC: 100.0/60.04/57.70 | MACRO-F1: 100.0/57.21/55.09 | upd/aug time: 3.36/0.00ms | node/edge ratio: 100.00/100.00%
Best Epoch: 131 | train/val/test | ACC: 100.0/66.80/67.90 | BACC: 100.0/63.78/61.71 | MACRO-F1: 100.0/62.26/60.08 | upd/aug time: 3.37/0.00ms | node/edge ratio: 100.00/100.00%
Best Epoch: 60 | train/val/test | ACC: 100.0/66.40/66.30 | BACC: 100.0/61.60/60.74 | MACRO-F1: 100.0/58.04/59.09 | upd/aug time: 3.34/0.00ms | node/edge ratio: 100.00/100.00%
Best Epoch: 151 | train/val/test | ACC: 100.0/63.40/63.70 | BACC: 100.0/58.00/55.99 | MACRO-F1: 100.0/53.57/51.88 | upd/aug time: 3.19/0.00ms | node/edge ratio: 100.00/100.00%
Avg Test Performance (5 runs): | ACC: 66.08 ± 0.70 | BACC: 59.34 ± 0.96 | MACRO-F1: 57.09 ± 1.40
================== Dataset [Cora] - StepIR [10] - ToBA [pred] ==================
Best Epoch: 95 | train/val/test | ACC: 100.0/64.80/63.70 | BACC: 100.0/63.14/60.69 | MACRO-F1: 100.0/60.22/58.30 | upd/aug time: 3.48/3.58ms | node/edge ratio: 100.26/103.05%
Best Epoch: 157 | train/val/test | ACC: 100.0/71.80/69.70 | BACC: 100.0/71.59/68.44 | MACRO-F1: 100.0/69.45/66.74 | upd/aug time: 3.36/3.64ms | node/edge ratio: 100.26/103.19%
Best Epoch: 177 | train/val/test | ACC: 100.0/73.40/73.20 | BACC: 100.0/73.27/71.69 | MACRO-F1: 100.0/71.31/70.53 | upd/aug time: 3.34/3.64ms | node/edge ratio: 100.26/102.89%
Best Epoch: 340 | train/val/test | ACC: 100.0/70.20/73.00 | BACC: 100.0/65.76/67.88 | MACRO-F1: 100.0/64.42/67.45 | upd/aug time: 3.41/3.84ms | node/edge ratio: 100.26/103.13%
Best Epoch: 90 | train/val/test | ACC: 100.0/66.60/67.30 | BACC: 100.0/61.18/59.96 | MACRO-F1: 100.0/58.85/58.07 | upd/aug time: 3.19/3.65ms | node/edge ratio: 100.26/103.23%
Avg Test Performance (5 runs): | ACC: 69.38 ± 1.60 | BACC: 65.73 ± 2.06 | MACRO-F1: 64.22 ± 2.28
================== Dataset [Cora] - StepIR [10] - ToBA [topo] ==================
Best Epoch: 72 | train/val/test | ACC: 100.0/72.00/72.20 | BACC: 100.0/69.65/68.93 | MACRO-F1: 100.0/66.88/67.10 | upd/aug time: 3.12/4.10ms | node/edge ratio: 100.26/101.43%
Best Epoch: 263 | train/val/test | ACC: 100.0/72.80/71.70 | BACC: 100.0/72.59/69.01 | MACRO-F1: 100.0/72.05/68.70 | upd/aug time: 3.51/4.10ms | node/edge ratio: 100.26/101.75%
Best Epoch: 186 | train/val/test | ACC: 100.0/74.00/73.70 | BACC: 100.0/74.37/73.10 | MACRO-F1: 100.0/71.61/71.04 | upd/aug time: 3.36/4.15ms | node/edge ratio: 100.26/101.56%
Best Epoch: 71 | train/val/test | ACC: 100.0/72.40/72.10 | BACC: 100.0/69.50/67.75 | MACRO-F1: 100.0/68.11/66.80 | upd/aug time: 3.31/4.12ms | node/edge ratio: 100.26/101.55%
Best Epoch: 77 | train/val/test | ACC: 100.0/76.20/77.60 | BACC: 100.0/78.03/77.92 | MACRO-F1: 100.0/75.06/76.42 | upd/aug time: 3.34/4.10ms | node/edge ratio: 100.26/101.58%
Avg Test Performance (5 runs): | ACC: 73.46 ± 0.97 | BACC: 71.34 ± 1.68 | MACRO-F1: 70.01 ± 1.58
```
Full argument list and descriptions are as follows:
```
--gpu_id | int, default=0
Specify which GPU to use for training. Set to -1 to use the CPU.
--seed | int, default=42
Random seed for reproducibility in training.
--n_runs | int, default=5
The number of independent runs for training.
--debug | bool, default=False
Enable debug mode if set to True.
--dataset | str, default="cora"
Name of the dataset to use for training.
Supports "cora," "citeseer," "pubmed," "cs", "physics".
--imb_type | str, default="step", choices=["step", "natural"]
Type of imbalance to handle in the dataset. Choose from "step" or "natural".
--imb_ratio | int, default=10
Imbalance ratio for handling imbalanced datasets.
--gnn_arch | str, default="GCN", choices=["GCN", "GAT", "SAGE"]
Graph neural network architecture to use. Choose from "GCN," "GAT," or "SAGE."
--n_layer | int, default=3
The number of layers in the GNN architecture.
--hid_dim | int, default=256
Hidden dimension size for the GNN layers.
--lr | float, default=0.01
Initial learning rate for training.
--weight_decay | float, default=5e-4
Weight decay for regularization during training.
--epochs | int, default=2000
The number of training epochs.
--early_stop | int, default=200
Patience for early stopping during training.
--tqdm | bool, default=False
Enable a tqdm progress bar during training if set to True.
--toba_mode | str, default="all", choices=["dummy", "pred", "topo", "all"]
Mode of the ToBA. Choose from "dummy," "pred," "topo," or "all."
if "dummy," ToBA is disabled.
if "pred," ToBA is enabled with only prediction-based augmentation.
if "topo," ToBA is enabled with only topology-based augmentation.
if "all," will run all modes and report the result for comparison.
```
### Jupyter Notebook
We also provide an example jupyter notebook [train_example.ipynb](https://github.com/AnonAuthorAI/ToBA/blob/main/train_example.ipynb) with experimental results on:
- Datasets: ['cora', 'citeseer', 'pubmed']
- ToBA modes: ['dummy', 'pred', 'topo']
- Imbalance types:
- 'step': [10, 20]
- 'natural': [50, 100]
## API reference
### TopoBalanceAugmenter
https://github.com/AnonAuthorAI/ToBA/blob/main/toba.py#L170
Main class that implements the ToBA augmentation algorithm, inheriting from [`BaseGraphAugmenter`](https://github.com/AnonAuthorAI/ToBA/blob/main/toba.py#L11). Implements 3 core steps of ToBA:
- (1) node risk estimation
- (2) candidate class selection
- (3) virtual topology augmentation.
```python
class TopoBalanceAugmenter(BaseGraphAugmenter):
"""
Topological Balanced Augmentation (ToBA) for graph data.
Parameters:
- mode: str, optional (default: "pred")
The augmentation mode. Must be one of ["dummy", "pred", "topo"].
- random_state: int or None, optional (default: None)
Random seed for reproducibility.
"""
```
Core methods:
- `init_with_data(data)`: initialize the augmenter with graph data.
- Parameters:
- `data` : PyG data object
- Return:
- `self` : TopoBalanceAugmenter
- `augment(model, x, edge_index)`: perform topology-aware graph augmentation.
- Parameters:
- `model` : torch.nn.Module, node classification model
- `x` : torch.Tensor, node feature matrix
- `edge_index` : torch.Tensor, sparse edge index
- Return:
- `x_aug` : torch.Tensor, augmented node feature matrix
- `edge_index_aug`: torch.Tensor, augmented sparse edge index
- `info` : dict, augmentation info
- `adapt_labels_and_train_mask(y, train_mask)`: adapt labels and training mask after augmentation.
- Parameters:
- `y` : torch.Tensor, node label vector
- `train_mask` : torch.Tensor, training mask
- Return:
- `new_y` : torch.Tensor, adapted node label vector
- `new_train_mask` : torch.Tensor, adapted training mask
### NodeClassificationTrainer
https://github.com/AnonAuthorAI/ToBA/blob/main/trainer.py#L14
Trainer class for node classification tasks, centralizing the training workflow:
- (1) model preparation and selection
- (2) performance evaluation
- (3) data augmentation
- (4) verbose logging.
```python
class NodeClassificationTrainer:
"""
A trainer class for node classification with Graph Augmenter.
Parameters:
-----------
- model: torch.nn.Module
The node classification model.
- data: pyg.data.Data
PyTorch Geometric data object containing graph data.
- device: str or torch.device
Device to use for computations (e.g., 'cuda' or 'cpu').
- augmenter: BaseGraphAugmenter, optional
Graph augmentation strategy.
- learning_rate: float, optional
Learning rate for optimization.
- weight_decay: float, optional
Weight decay (L2 penalty) for optimization.
- train_epoch: int, optional
Number of training epochs.
- early_stop_patience: int, optional
Number of epochs with no improvement to trigger early stopping.
- eval_freq: int, optional
Frequency of evaluation during training.
- eval_metrics: dict, optional
Dictionary of evaluation metrics and associated functions.
- verbose_freq: int, optional
Frequency of verbose logging.
- verbose_config: dict, optional
Configuration for verbose logging.
- save_model_dir: str, optional
Directory to save model checkpoints.
- save_model_name: str, optional
Name of the saved model checkpoint.
- enable_tqdm: bool, optional
Whether to enable tqdm progress bar.
- random_state: int, optional
Seed for random number generator.
"""
```
Core methods:
- `train`: train the node classification model and perform evaluation.
- Parameters:
- `train_epoch`: int, optional. Number of training epochs.
- `eval_freq`: int, optional. Frequency of evaluation during training.
- `verbose_freq`: int, optional. Frequency of verbose logging.
- Return:
- `model`: torch.nn.Module, trained node classification model.
- `print_best_results`: print the evaluation results of the best model.
## Emprical Results
### Experimental Setup
To fully validate **ToBA**'s performance and compatibility with existing IGL techniques and GNN backbones, we test 6 baseline methods with 5 popular GNN backbone architectures in our experiments, and apply ToBA with them under all possible combinations:
- **Datasets**: Cora, Citeseer, Pubmed, CS, Physics
- **Imbalance-handling techniques**:
- Reweighting [1]
- ReNode [2]
- Oversample [3]
- SMOTE [4]
- GraphSMOTE [5]
- GraphENS [6]
- **GNN backbones**:
- GCN [7]
- GAT [8]
- SAGE [9]
- APPNP [10]
- GPRGNN [11]
- **Imbalance types & ratios**:
- **Step imbalance**: 10:1, 20:1
- **Natural imbalance**: 50:1, 100:1
### On the effectiveness and versatility of TOBA
We first report the detailed empirical results of applying **ToBA** with 6 IGL baselines and 5 GNN backbones on 3 imbalanced graphs (Cora, CiteSeer, and PubMed) with IR=10 in Table 1. In all 3 (datasets) x 5 (backbones) x 7 (baselines) x 2 (name variants) x 3 (metrics) = **630 setting combinations**, it achieves significant and consistent performance improvements on the basis of other IGL techniques, which also yields new state-of-the-art performance. In addition to the superior performance in boosting classification, **ToBA** also greatly reduces the model predictive bias.

### On the robustness of TOBA
We now test **ToBA**'s robustness to varying types of extreme class-imbalance. In this experiment, we extend Table `main` and consider a more challenging scenario with IR = 20. In addition, we consider the natural (long-tail) class imbalance that is commonly observed in real-world graphs with IR of 50 and 100. Datasets from (*CS, Physics*) are also included to test **ToBA**'s performance on large-scale tasks. Results show that **ToBA** consistently demonstrates superior performance in boosting classification and reducing predictive bias.

### On mitigating AMP and DMP
We further design experiments to verify whether **ToBA** can effectively handle the topological challenges identified in this paper, i.e., ambivalent and distant message-passing. Specifically, we investigate whether **ToBA** can improve the prediction accuracy of minority class nodes that are highly influenced by ambivalent/distant message-passing, i.e., high local heterophilic ratios/long distance to supervision signals. Results are shown in the figure below (5 independent runs with GCN classifier, IR=10). As can be observed, **ToBA** effectively alleviates the negative impact of AMP and DMP and helps node classifiers to achieve better performance in minority classes.

## References
| # | Reference |
| ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [1] | Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002. |
| [2] | Deli Chen, Yankai Lin, Guangxiang Zhao, Xuancheng Ren, Peng Li, Jie Zhou, and Xu Sun. Topology-imbalance learning for semi-supervised node classification. Advances in Neural Information Processing Systems, 34:29885–29897, 2021. |
| [3] | Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002. |
| [4] | Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002. |
| [5] | Tianxiang Zhao, Xiang Zhang, and Suhang Wang. Graphsmote: Imbalanced node classification on graphs with graph neural networks. In Proceedings of the 14th ACM international conference on web search and data mining, pages 833–841, 2021. |
| [6] | Joonhyung Park, Jaeyun Song, and Eunho Yang. Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. In International Conference on Learning Representations, 2022. |
| [7] | Max Welling and Thomas N Kipf. Semi-supervised classification with graph convolutional networks. In J. International Conference on Learning Representations (ICLR 2017), 2016. |
| [8] | Petar Veliˇckovi ́c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. |
| [9] | Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. |
| [10] | Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018. |
| [11] | Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. arXiv preprint arXiv:2006.07988, 2020. |