# s2_fos **Repository Path**: mirrors_allenai/s2_fos ## Basic Information - **Project Name**: s2_fos - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-04-01 - **Last Updated**: 2026-02-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # s2_fos Code for scoring, training, and data generation for Semantic Scholar's Field of Study (FoS) classifier model. The model utilizes a fine-tuned SciBERT model to predict the field of study for a given paper. During inference: - The language of the paper is first determined. If the paper is in English, the Field of Study categories and their corresponding scores are predicted. ## Installation To install this package with `poetry`, run the following commands: ```bash git clone https://github.com/allenai/s2_fos.git cd s2_fos # Install poetry curl -sSL https://install.python-poetry.org | python3 - poetry install # Activate the virtual environment poetry shell # Due to the non-compliance of fasttext with PEP 518, # we need to install it manually pip install fasttext ``` If you encounter problems installing Poetry, please refer to the [Poetry documentation](https://python-poetry.org/docs/#installation). Alternatively, you can install with anaconda: ```bash git clone https://github.com/allenai/s2_fos.git cd s2_fos conda create -y --name s2fos python==3.8 conda activate s2fos pip install -e . pip install fasttext ``` To obtain the necessary data, run these commands after the package is installed: ```bash # Download the Language identification model from fasttext cd data && wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin ``` ## Hugging Face Artifacts Model weights, training data, and annotations are available on Hugging Face under the [ImpACT License Low Risk](https://allenai.org/licenses/impact-lr). Model weights and config can be found under: [allenai/scibert_scivocab_uncased_field_of_study](https://huggingface.co/allenai/scibert_scivocab_uncased_field_of_study) Training data, annotations, and OpenAI responses can be found under: [allenai/fos_model_training_data_open_ai_annotations](https://huggingface.co/datasets/allenai/fos_model_training_data_open_ai_annotations) To be able to access the model weights and training data: - If you don't have one, create an [Hugging Face account](https://huggingface.co). - If you don't have one, generate [a Hugging Face token](https://huggingface.co/settings/tokens). - While logged in, click on both of the model weights and data links above and accept the license agreements. Set your Hugging Face token as an environment variable with the following command (replace `` with your actual token): ```bash export HUGGINGFACE_HUB_TOKEN= ``` Or set the token in Python before importing from `s2_fos`: ``` import os os.environ['HUGGINGFACE_HUB_TOKEN'] = '' ``` ## Inference Example Code ```python from s2_fos import S2FOS # Example paper data # note that journal_name is just a convention - it can be any venue like NeurIPS, arxiv, etc papers = [{ 'title': "A Prototype-Based Few-Shot Named Entity Recognition", 'abstract': ("Few-shot Named Entity Recognition (NER) task focuses on identifying named entities with " "a small amount of supervised training data. Work based on prototype networks shows " "strong adaptability for the few-shot NER task. We believe that the core idea of these " "approaches is to learn how to aggregate the representation of token mappings in vector " "space around entity classes. However, to our knowledge, no work has investigated its " "effect. Therefore, we propose ClusLoss and ProEuroLoss, aiming to enhance the model's " "ability to aggregate semantic information spatially, thus helping the model " "better distinguish between entity types. Experimental results show that ProEuroLoss achieves " "state-of-the-art performance on average F1 scores for both 1-shot and 5-shot NER " "tasks, while ClusLoss has competitive performance in such tasks."), 'journal_name': "Proceedings of the 8th International Conference on Computing and Artificial Intelligence", }] # Initialize the predictor predictor = S2FOS() # Predict the fields of study print(predictor.predict(papers)) ``` ## Development To run the tests with anaconda, execute the following commands: ```bash poetry shell poetry run pytest ``` Or with anaconda: ```bash pip install pytest pytest test ``` ## Training The Python file `train_net.py` contains code for model fine-tuning. To run the training, execute the following command on an appropriate GPU machine (note that it can also run on a CPU, but it will be very slow): first you need to replace in the src/s2_fos/training/run.sh script with the appropriate values Next run the training code with the following parameters: ```bash cd src/s2_fos/training poetry shell bash run.sh ``` Training data is downloaded from Hugging Face under the [ImpACT License Low Risk](https://allenai.org/licenses/impact-lr). Training data is downloaded into `~/.cache/huggingface/hub/datasets--allenai--fos_model_training_data_open_ai_annotations` folder. The training data is split into train, test, and validation sets with the following ratios: 0.7/0.15/0.15 automatically. If you to provide training data manually you can use --train_data_path, --test_data_path, and --validation_data_path parameters. ## Calling the OpenAI API to Generate Training Data To call the OpenAI API, you need to set the `OPENAI_API_KEY` environment variable to your API key. An example script is located at `src/s2_fos/training/open_ai_prompts.py`. Run it with the following commands: ```bash poetry shell poetry run python src/s2_fos/training/open_ai_prompts.py ``` This script reads data from `data/paper_title_abstract_example.json` and writes the results to `data/paper_title_abstract_example_openai.json`. The OpenAI prompt configuration is defined within `src/s2_fos/training/open_ai_prompts.py`.