# BioConceptVec **Repository Path**: limeiyang/BioConceptVec ## Basic Information - **Project Name**: BioConceptVec - **Description**: 为了方便下载导入此文件，来源 https://github.com/ncbi-nlp/BioConceptVec.git - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-27 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale [![HitCount](http://hits.dwyl.com/ncbi-nlp/BioConceptVec.svg)](http://hits.dwyl.com/ncbi-nlp/BioConceptVec) ## Table of contents * [Text corpora](#text-corpora) * [Named Entity Recognition (NER) tools](#pubtator) * [BioConceptVec: embeddings and concept files](#bioconceptvec) * [Tutorial](#tutorial) * [Datasets](#dataset) * [References](#references) * [Acknowledgments](#acknowledgments) ## Text corpora We created BioConceptVec using the entire [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/). The texts were split and tokenized using [NLTK](https://www.nltk.org/). We also lowercased all the words. ## Using PubTator for annotating concepts in the PubMed We employed [PubTator](https://www.ncbi.nlm.nih.gov/research/pubtator/) to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts. ## BioConceptVec: embeddings and concept files We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the **embedding**(contains concepts and other words) in binary format and the **concept-only** file in json format available. 1. **BioConceptVec cbow:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_word2vec_cbow.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_cbow.json) (798MB). 2. **BioConceptVec skip-gram:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_word2vec_skipgram.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_skip.json) (812MB). 3. **BioConceptVec glove:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_glove.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_glove.json) (835MB). 4. **BioConceptVec fastText:** [embedding](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/bioconceptvec_fasttext.bin) (2.4GB) and [concept-only](https://ftp.ncbi.nlm.nih.gov/pub/lu/BioConceptVec/concept_fast.json) (813MB). ## Tutorial You can find [this tutorial](https://github.com/ncbi-nlp/BioConceptVec/blob/master/bioconcept_tutorial.ipynb) on how to use BioConceptVec (for both embedding and concept-only files) for a quick start. ## Datasets We also make all the 9 evaluation datasets publicly available. It covers 4 applications: 1. [**Drug-Gene interactions**](https://github.com/ncbi-nlp/BioConceptVec/tree/master/datasets/drug_gene_interactions). The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes. 2. [**Gene-Gene interactions**](https://github.com/ncbi-nlp/BioConceptVec/tree/master/datasets/gene_gene_interactions). 5 datasets on gene-gene interactions have the same format as above. 3. [**Protein-Protein interaction**](https://github.com/ncbi-nlp/BioConceptVec/tree/master/datasets/protein_protein_interactions). It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction. 4. [**Drug-Drug interaction**](https://github.com/ncbi-nlp/BioConceptVec/tree/master/datasets/drug_drug_interactions). This dataset is from [Drug-Drug interaction semeval-2013](https://www.cs.york.ac.uk/semeval-2013/task9/). Please see the details there. ## References When using our resources, please cite the following papers: Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). [BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale](https://arxiv.org/ftp/arxiv/papers/1912/1912.10846.pdf). To appear in PLOS Computational Biology. ## Acknowledgments This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.