# TAFS **Repository Path**: spring996/TAFS ## Basic Information - **Project Name**: TAFS - **Description**: Topology-Aware Functional Similarity: Integrating Extended Neighborhoods via Exponential Attenuation - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-31 - **Last Updated**: 2025-08-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TAFS Protein Function Prediction Model ## Overview TAFS (Topology-Aware Functional Similarity) is a protein function prediction model that utilizes the topological structure of protein-protein interaction (PPI) networks to infer functional relationships between proteins. This implementation provides a framework for predicting protein functions based on PPI network topological information. ## Key Features - **Topology-based similarity calculation**: Measures functional relevance using network distance metrics - **Efficient computation**: Optimizes performance through matrix operations - **Adjustable parameters**: Allows customization of distance weight decay factor (gamma); default value is 0.15 ## Core Runtime Environment ```markdown numpy: 1.26.4 pandas: 2.2.3 networkx: 3.3 loguru: 0.7.2 ``` ## Data Preparation The protein function annotation data integration process consists of six key steps: 1. GAF Data Initialization: Extracts experimentally validated protein function annotations from Gene Ontology Annotation Files (GAF), filtering for six high-confidence evidence codes including EXP/IDA, while performing GO level filtering. 2. STRING Network Processing: Loads PPI networks, filters low-confidence interactions (default threshold 500), and constructs graph structures containing all STRING protein nodes. 3. ID Mapping Conversion: Unifies gene IDs from GAF and protein IDs from STRING into standard UniProt IDs using mapping tables, calculating their intersection as the core dataset. 4. Domain Information Integration: Extracts conserved domain features of common proteins from InterProScan results to establish protein-domain mapping dictionaries. 5. Network Reconstruction: Rebuilds PPI networks based on shared UniProt IDs, retaining only nodes and corresponding edges present in both datasets to ensure data consistency. 6. Annotation Data Filtering: Reverse maps UniProt IDs to original IDs, filters final protein function annotations participating in the network, and generates integrated datasets containing protein lists, PPI networks, domain dictionaries, and GO annotations. Processed data is stored in the `TAFS_Data` class structure with the following fields: ```python class TAFS_Data: # Dataset name dataset_name: str # Protein ID list, maintaining protein name order protein_IDs: list[str] # PPI network graph (NetworkX format) PPI_g: nx.Graph # Protein domain dictionary {Protein ID: domain set} IDs_domain_dict: dict[str, set] # GO annotation dataframe containing columns: # - 'DB_Object_ID': Protein ID # - 'GO_ID': GO number # - 'Evidence': Evidence code # - 'GO_level': GO level # - 'Namespace': Functional classification (F/P/C) GO_df: pd.DataFrame ``` Demo data download address: https://gitee.com/spring996/TAFS/tree/master/Data ## Model Usage ### Initialization ```python # Prepare input data data = TAFS_Data() data.dataset_name = "Example Dataset" data.protein_IDs = ["P1", "P2", "P3"] data.PPI_g = nx.Graph() # Add actual PPI network edges data.IDs_domain_dict = {"P1": {"D1", "D2"}, "P2": {"D2"}} data.GO_df = pd.DataFrame(...) # Populate GO annotation data # Initialize TAFS model gamma_dict = {"gamma_k": 0.15} # Distance weight decay parameter model = TAFS( protein_name_list=data.protein_IDs, g_ppi=data.PPI_g, gamma_dict=gamma_dict, prf_type="FS" # Prediction method ) ``` ### Model Training ```python target_proteins = ("P1", "P2") # Proteins to predict func_names = tuple(data.GO_df['GO_ID'].unique()) # Functional terms ref_protein_funcs_numpy = ... # Generate reference matrix from GO_df model.train( target_proteins=target_proteins, func_names=func_names, ref_protein_funcs_numpy=ref_protein_funcs_numpy ) ``` ### Function Prediction ```python # Get thresholded numpy array prediction results predictions_numpy = model.predFuncsThreshold_numpy( target_protein_names=target_proteins, Threshold=0.5 ) # Get dictionary-form prediction results predictions_dict = model.predFuncsThreshold_dict( target_protein_names=target_proteins, Threshold=0.5 ) ``` ## Parameter Description - `gamma_k`: Distance weight decay factor (default 0.15) - Higher values emphasize direct neighbors - Lower values give more weight to distant connections - `prf_type`: Prediction method (currently only "FS" supported) ## Performance Notes - The model precomputes shortest paths between all node pairs to improve similarity calculation efficiency - Uses NumPy for optimized matrix operations - Large networks may require significant memory for similarity matrix storage ## References Peng. W. Topology-Aware Functional Similarity: Integrating Extended Neighborhoods via Exponential Attenuation ## License MIT License ## Contact Information For questions, please contact: Peng Wang pengw@ctbu.edu.cn