# scrap-science

**Repository Path**: he1-corde/scrap-science

## Basic Information

- **Project Name**: scrap-science
- **Description**: A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-05-25
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# scrap-science
A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.


## Setup
* python environment
* chrome driver
* pdf2txt

## arxiv
1. get all search result urls and their corresponding number of search results
2. dump these into scrap_arvix.ipynb and get a clean duplicate free list of urls of individual papers
3. get all pdfs
4. scarp data
5. manula quality check

## bioRxiv
1. get all search result urls and their corresponding number of search results
2. dump these into scrap_biorxiv.ipynb and get a clean duplicate free list of urls of individual papers
3. now scarp - this step includes getting the pdf too.
4. manual quality check

## Pubmed
1. search pubmed and download results as .csv into raw_result folder
2. use scrap_pubmed.ipynb to combine all csv's, remove duplicates and finally scrap it (no pdfs)
3. save as .csv and do manual search quality check

## MICCAI
1. get content pdfs from springer
* 2014 and 2015 - get urls manually
* 2016 and 2017 - pdfs have urls in them
2. run getMiccaiUrls.py to get the urls in the pdf and dump them in a list as a .npy file
3. read these in scrap_miccai.ipynb and add to them those hardcoded from 2014 and 2015
4. run scrap_miccai.ipynb (no pdfs)

## IEEE
1. go to http://ieeexplore.ieee.org/Xplore/home.jsp
2. enter keywords and download .csv. Link to search will be in the first row.
3. combine and clean the multiple downloaded .csv's using combine_Ieee.ipynb. This produces a single ieee.csv without duplicates.
4. run scrap_ieee.ipynb. We first get as much pdf as we can, then we loop through the csv and convert pdf2txt to get emails. 
5. manual cleanup