# uniprot **Repository Path**: hf-datasets/uniprot ## Basic Information - **Project Name**: uniprot - **Description**: Mirror of https://huggingface.co/datasets/damlab/uniprot - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2023-10-30 - **Last Updated**: 2024-08-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- liscence: mit --- # Dataset Description ## Dataset Summary This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English ## Dataset Structure ### Data Instances Data Fields: id, description, sequence Data Splits: None ## Dataset Creation The dataset was downloaded and parsed into a `dataset` object and uploaded unchanged. Initial Data Collection and Normalization: Dataset was downloaded and curated on 03/09/2022. ## Considerations for Using the Data Social Impact of Dataset: Due to the tendency of HIV to mutate, drug resistance is a common issue when attempting to treat those infected with HIV. Protease inhibitors are a class of drugs that HIV is known to develop resistance via mutations. Thus, by providing a collection of protease sequences known to be resistant to one or more drugs, this dataset provides a significant collection of data that could be utilized to perform computational analysis of protease resistance mutations. Discussion of Biases: Due to the sampling nature of this database, it is predominantly composed genes from "well studied" genomes. This may impact the "broadness" of the genes contained. ## Additional Information: - Dataset Curators: Will Dampier - Citation Information: TBA