# pubmedextract **Repository Path**: mirrors_allenai/pubmedextract ## Basic Information - **Project Name**: pubmedextract - **Description**: extracting demographics from tables in pubmed papers - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-24 - **Last Updated**: 2026-05-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PubMed-Extract Quantifying demographic bias in clinical trials using a corpus of academic papers. This repo accompanies the paper: [Quantifying Sex Bias in Clinical Studies at Scale with Automated Data Extraction](https://www.ncbi.nlm.nih.gov/pubmed/31268541) by Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trepman, Madeleine van Zuylen, and Oren Etzioni. This code takes as input a clinical trial paper parsed by Omnipage and returns the extracted number of participating women and men. This package is being released for (a) algorithmic documentation and (b) statistical analysis reproduction purposes, and will not work for extracting clinical trial participant counts from new PDFs you may have as it depends on Omnipage. For an example of the type of input that PubMed-Extract expects, see `tests/test_sex/papers/`. The code for (a) is located in `pubmedextract/` The code for (b) is located in `analysis_scripts/`. Note that the paper compares PubMed-Extract to an algorithm referred to as AACT-Query. This is a relatively simple algorithm and its execution (essentially a SQL query) is contained entirely within `analysis_scripts/04_analysis.py`. ## Installation This project requires **Python 3.6**. We recommend you set up a conda environment: ``` conda create -n pubmedextract python=3.6 conda activate pubmedextract conda install spacy=1.9 thinc=6.5.2 matplotlib=3.0.1 seaborn=0.9.0 joblib=0.12.0 psycopg2=2.7.5 pandas=0.23.4 statsmodels=0.9.0 patsy pytest pylint ``` You may need to do `source activate pubmedextract` instead of `conda activate pubmedxtract`, depending on your `anaconda` version. Then clone the repo, and install it (along with remaining requirements). ``` git clone https://github.com/allenai/pubmedextract.git cd pubmedextract python setup.py install ``` ## Tests After installing, you can run all the unit tests: ``` pylint --disable=R,C,W pubmedextract python -m pytest tests/ ``` ## Usage Example: Extracting Gender Counts from Available JSON Inputs A simple example is in `scripts/parse_paper_example.py`, and also reproduced in its entirety below: ``` import pickle from pubmedextract.sex import get_sex_counts from pubmedextract.table_utils import PaperTable # load some example papers # assumes the cwd is pubmedextract/ with open('tests/test_sex/test_papers_and_counts.pickle', 'rb') as f: s2ids_and_true_counts, _ = pickle.load(f) # get the counts and print them out for s2id, true_counts in s2ids_and_true_counts: paper = PaperTable(s2id, 'tests/test_sex/papers/') demographic_info = get_sex_counts(paper) print('True counts:', true_counts) print('Estimated counts:', demographic_info.counts_dict, '\n') ``` ## Paper Analysis Reproduction The scripts needed to reproduce the analyses in the paper `Quantifying Sex Bias in Clinical Studies at Scale with Automated Data Extraction` can be found here: https://github.com/allenai/pubmedextract/tree/master/analysis_scripts.