# TAFS

**Repository Path**: spring996/TAFS

## Basic Information

- **Project Name**: TAFS
- **Description**: Topology-Aware Functional Similarity: Integrating Extended Neighborhoods via Exponential Attenuation
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-12-31
- **Last Updated**: 2025-08-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TAFS Protein Function Prediction Model

## Overview

TAFS (Topology-Aware Functional Similarity) is a protein function prediction model that utilizes the topological structure of protein-protein interaction (PPI) networks to infer functional relationships between proteins. This implementation provides a framework for predicting protein functions based on PPI network topological information.

## Key Features

- **Topology-based similarity calculation**: Measures functional relevance using network distance metrics
- **Efficient computation**: Optimizes performance through matrix operations
- **Adjustable parameters**: Allows customization of distance weight decay factor (gamma); default value is 0.15

## Core Runtime Environment

```markdown
numpy: 1.26.4
pandas: 2.2.3
networkx: 3.3
loguru: 0.7.2
```

## Data Preparation

The protein function annotation data integration process consists of six key steps:

1. GAF Data Initialization: Extracts experimentally validated protein function annotations from Gene Ontology Annotation Files (GAF), filtering for six high-confidence evidence codes including EXP/IDA, while performing GO level filtering.
2. STRING Network Processing: Loads PPI networks, filters low-confidence interactions (default threshold 500), and constructs graph structures containing all STRING protein nodes.
3. ID Mapping Conversion: Unifies gene IDs from GAF and protein IDs from STRING into standard UniProt IDs using mapping tables, calculating their intersection as the core dataset.
4. Domain Information Integration: Extracts conserved domain features of common proteins from InterProScan results to establish protein-domain mapping dictionaries.
5. Network Reconstruction: Rebuilds PPI networks based on shared UniProt IDs, retaining only nodes and corresponding edges present in both datasets to ensure data consistency.
6. Annotation Data Filtering: Reverse maps UniProt IDs to original IDs, filters final protein function annotations participating in the network, and generates integrated datasets containing protein lists, PPI networks, domain dictionaries, and GO annotations.

Processed data is stored in the `TAFS_Data` class structure with the following fields:

```python
class TAFS_Data:
    # Dataset name
    dataset_name: str
    
    # Protein ID list, maintaining protein name order
    protein_IDs: list[str]
    
    # PPI network graph (NetworkX format)
    PPI_g: nx.Graph
    
    # Protein domain dictionary {Protein ID: domain set}
    IDs_domain_dict: dict[str, set]
    
    # GO annotation dataframe containing columns:
    # - 'DB_Object_ID': Protein ID
    # - 'GO_ID': GO number
    # - 'Evidence': Evidence code
    # - 'GO_level': GO level
    # - 'Namespace': Functional classification (F/P/C)
    GO_df: pd.DataFrame
```

Demo data download address:

https://gitee.com/spring996/TAFS/tree/master/Data

## Model Usage

### Initialization

```python
# Prepare input data
data = TAFS_Data()
data.dataset_name = "Example Dataset"
data.protein_IDs = ["P1", "P2", "P3"]
data.PPI_g = nx.Graph()  # Add actual PPI network edges
data.IDs_domain_dict = {"P1": {"D1", "D2"}, "P2": {"D2"}}
data.GO_df = pd.DataFrame(...)  # Populate GO annotation data

# Initialize TAFS model
gamma_dict = {"gamma_k": 0.15}  # Distance weight decay parameter
model = TAFS(
    protein_name_list=data.protein_IDs,
    g_ppi=data.PPI_g,
    gamma_dict=gamma_dict,
    prf_type="FS"  # Prediction method
)
```

### Model Training

```python
target_proteins = ("P1", "P2")  # Proteins to predict
func_names = tuple(data.GO_df['GO_ID'].unique())  # Functional terms
ref_protein_funcs_numpy = ...  # Generate reference matrix from GO_df

model.train(
    target_proteins=target_proteins,
    func_names=func_names,
    ref_protein_funcs_numpy=ref_protein_funcs_numpy
)
```

### Function Prediction

```python
# Get thresholded numpy array prediction results
predictions_numpy = model.predFuncsThreshold_numpy(
    target_protein_names=target_proteins,
    Threshold=0.5
)

# Get dictionary-form prediction results
predictions_dict = model.predFuncsThreshold_dict(
    target_protein_names=target_proteins,
    Threshold=0.5
)
```

## Parameter Description

- `gamma_k`: Distance weight decay factor (default 0.15)
    - Higher values emphasize direct neighbors
    - Lower values give more weight to distant connections
- `prf_type`: Prediction method (currently only "FS" supported)

## Performance Notes

- The model precomputes shortest paths between all node pairs to improve similarity calculation efficiency
- Uses NumPy for optimized matrix operations
- Large networks may require significant memory for similarity matrix storage

## References

Peng. W. Topology-Aware Functional Similarity: Integrating Extended Neighborhoods via Exponential Attenuation

## License

MIT License

## Contact Information

For questions, please contact:

Peng Wang

pengw@ctbu.edu.cn