# Graph-LLM **Repository Path**: mmmz2/Graph-LLM ## Basic Information - **Project Name**: Graph-LLM - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-04-01 - **Last Updated**: 2024-04-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs **UPDATE**: The pt file for Citeseer has some problems. Please use the latest version `citeseer2` instead of the version inside small_data.zip. We use Graph Cleaner [Graph Cleaner](https://github.com/lywww/GraphCleaner/tree/master/case_studies) to fix wrong labels. This is the official code repository for our paper [Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs](https://arxiv.org/abs/2307.03393) We are developing a library [LLM4Graph](https://github.com/CurryTang/LLM4Graph) to support more LLM-GNN models, and more node-level, link-level, graph-level TAG datasets. ## Introduction Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and then generate predictions through GNNs. The latter attempts to directly employ LLMs as standalone predictors. We conduct comprehensive and systematical studies on these two pipelines under various settings. From comprehensive empirical results, we make original observations and find new insights that open new possibilities and suggest promising directions to leverage LLMs for learning on graphs. We provide the implementation of the following pipelines. ### LLMs-as-Predictors Check `ego_graph.py`, and directly use `ChatGPT` to do zero-shot/few-shot predictions. ![LLMs-as-Predictors](https://github.com/CurryTang/Graph-LLM/blob/master/imgs/llm_as_predictor.png) ### LLMs-as-Enhancers Check `baseline.py`, various kinds of embedding-visible LLMs (like LLaMA, SentenceBERT, or text-ada-embedding-002) can be used to generate embeddings as node features. ![LLMs-as-Enhancers](https://github.com/CurryTang/Graph-LLM/blob/master/imgs/llm_as_enhancer.png) ### (New project) LLMs-as-Annotators Check out our new project here: [Label-free Node Classification on Graphs with Large Language Models (LLMS) ](https://github.com/CurryTang/LLMGNN) ## Citation ``` @article{Chen2023ExploringTP, title={Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs}, author={Zhikai Chen and Haitao Mao and Hang Li and Wei Jin and Haifang Wen and Xiaochi Wei and Shuaiqiang Wang and Dawei Yin and Wenqi Fan and Hui Liu and Jiliang Tang}, journal={ArXiv}, year={2023}, volume={abs/2307.03393} } ``` ## 0. Environment Setup ### Package Installation Assume your cuda version is 11.8 ``` conda create --name LLMGNN python=3.10 conda activate LLMGNN conda install pytorch==2.0.0 cudatoolkit=11.8 -c pytorch conda install -c pyg pytorch-sparse conda install -c pyg pytorch-scatter conda install -c pyg pytorch-cluster conda install -c pyg pyg pip install ogb conda install -c dglteam/label/cu118 dgl pip install transformers pip install --upgrade accelerate pip install openai pip install langchain pip install gensim pip install google-generativeai pip install -U sentence-transformers pip install editdistance pip install InstructorEmbedding pip install optuna pip install tiktoken pip install pytorch_warmup ``` ### Dataset We have provided the processed datasets via the following [google drive link](https://drive.google.com/drive/folders/1_laNA6eSQ6M5td2LvsEp3IL9qF6BC1KV?usp=sharing) To unzip the files, you need to 1. unzip the `small_data.zip` into `preprocessed_data/new` 2. If you want to use ogbn-products, unzip `big_data.zip` info `preprocessed_data/new` 3. Download and move `*_explanation.pt` and `*_pl.pt` into `preprocessed_data/new`. These files are related to TAPE. 4. unzip the `ada.zip` into `./` 5. Move `*_entity.pt` into `./` 5. Put `ogb_arxiv.csv` into `./preprocessed_data` ### Get ft and no-ft LM embeddings Refer to the following scripts ``` bash for setting in "random" do for data in "cora" "pubmed" do WANDB_DISABLED=True CUDA_VISIBLE_DEVICES=3 python3 lmfinetune.py --dataset $data --split $setting --batch_size=9 --label_smoothing 0.3 --seed_num 5 WANDB_DISABLED=True CUDA_VISIBLE_DEVICES=3 python3 lmfinetune.py --dataset $data --split $setting --batch_size=9 --label_smoothing 0.3 --seed_num 5 --use_explanation 1 done done ``` ### Generate pt files for all data formats Run ``` python python3 generate_pyg_data.py ``` ## 1. Experiments for **LLM-as-Enhancers** For feature-level, **LLM-as-Enhancers**, you may replicate the experiments using files **baseline.py** and **lmfinetune.py** For example, you may run param sweep with the following script ``` bash for model in "GCN" "GAT" "MLP" do for data in "cora" "pubmed" do for setting in "random" do # Add more formats here for format in "ft" do CUDA_VISIBLE_DEVICES=1 python3 baseline.py --model_name $model --seed_num 5 --sweep_round 40 --mode sweep --dataset $data --split $setting --data_format $format echo "$model $data $setting $format done" done done done done ``` Run with a specific group of hyperparameters ``` bash python3 baseline.py --data_format sbert --split random --dataset pubmed --lr 0.01 --seed_num 5 ``` Feature ensemble, separate each ensemble format with "\;" ``` bash CUDA_VISIBLE_DEVICES=1 python3 baseline.py --model_name GCN --num_split 1 --seed_num 5 --sweep_split 1 --sweep_round 5 --mode sweep --dataset pubmed --split random --ensemble_string sbert\;know_sep_sb\;ft\;pl\;know_exp_ft ``` Batch version for ogbn-products ``` bash CUDA_VISIBLE_DEVICES=7 python3 baseline.py --model_name SAGE --epochs 10 --num_split 1 --batchify 1 --dataset products --split fixed --data_format ft --normalize 1 --norm BatchNorm --mode main --lr 0.003 --dropout 0.5 --weight_decay 0 --hidden_dimension 256 --num_layers 3 ``` To replicate the results for RevGAT (You need to first run once with the default features to generate the dgl data) ``` bash python dgl_main.py --data_root_dir ./dgldata \ --pretrain_path ./preprocessed_data/new/arxiv_fixed_sbert.pt \ --use-norm --use-labels --n-label-iters=1 --no-attn-dst --edge-drop=0.3 --input-drop=0.25 --n-layers 2 --dropout 0.75 --n-hidden 256 --save kd --backbone rev --group 2 --mode teacher python dgl_main.py --data_root_dir ./dgldata \ --pretrain_path ./preprocessed_data/new/arxiv_fixed_sbert.pt \ --use-norm --use-labels --n-label-iters=1 --no-attn-dst --edge-drop=0.3 --input-drop=0.25 --n-layers 2 --dropout 0.75 --n-hidden 256 --save kd --backbone rev --group 2 --mode student --alpha 0.95 --temp 0.7 ``` To replicate the results for [SAGN](https://github.com/THUDM/SCR/tree/main/ogbn-products) and [GLEM](https://github.com/AndyJZhao/GLEM), you may check their repositories and put the processed pt file into their pipelines. ## 2. Experiments for **LLM-as-Predictors** Just run ``` bash python3 ego_graph.py ``` ## 3. (UPDATE) Further Experiments on OOD & Prompts In two recent studies titled [CAN LLMS EFFECTIVELY LEVERAGE GRAPH STRUCTURAL INFORMATION: WHEN AND WHY](https://arxiv.org/pdf/2309.16595.pdf) and [Explanations as Features: LLM-Based Features for Text-Attributed Graphs](https://arxiv.org/pdf/2305.19523), researchers probed a specific prompt tailored for the Arxiv dataset containing data from post-2023, data which ChatGPT's pre-training corpus doesn't cover. Notably, the results showed no decline in performance compared to the original dataset. This intriguing outcome prompts us to delve deeper into creating efficacious prompts across varied domains. Out-of-distribution (OOD) generalization, commonly known as Graph OOD, is a fervent area of discussion. Recent benchmarks, such as [GOOD](https://github.com/divelab/GOOD/tree/GOODv1/GOOD), indicate that GNNs don't fare well during structural and feature shifts. We embarked on an experiment using the Arxiv dataset to assess the potential of LLMs-as-Predictors, leveraging a prompt from [Explanations as Features: LLM-Based Features for Text-Attributed Graphs](https://arxiv.org/pdf/2305.19523), which exhibited superior performance. | | All avg | Val | Test | Best baseline (test) | |------------------ |-------------- |------- |---------- |---------------------- | | concept degree | 73.91 ± 0.63 | 73.01 | 72.79 | 63.00 | | covariate degree | 75.75 ± 3.6 | 70.23 | 68.21 | 59.08 | | concept time | 74.29 ± 0.96 | 72.66 | 71.98 | 67.45 | | covariate time | 72.69 ± 1.53 | 74.28 | 74.37 | 71.34 | * **Concept-shift**: Where P(Y|X) varies, yet its construct remains anchored to covariate-shift by adjusting the ratios in each domain. * **Covariate-shift**: While P(X) shifts, P(Y|X) remains consistent. For the covariate shift, there are configurations of 10/1/1 environments (train/val/test), and for the concept shift, it's 3/1/1 (train/val/test). The term `All avg` represents the mean performance across all environments. One discernible merit of using LLMs-as-Predictors is their heightened resilience to OOD shifts.