# dclm **Repository Path**: simon1239/dclm ## Basic Information - **Project Name**: dclm - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: Mivg-patch-1 - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-24 - **Last Updated**: 2024-07-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DataComp-LM (DCLM) **Deprecation notice** This repository is now deprecated. For the most up to date code, data and documentation, please visit [https://github.com/mlfoundations/dclm](https://github.com/mlfoundations/dclm). ## Table of Contents 1. [Introduction](#introduction) 2. [Leaderboard](#leaderboard) 3. [Getting Started](#getting-started) 4. [Selecting Raw Sources](#selecting-raw-sources) 5. [Processing the Data](#processing-the-data) 6. [Tokenize and Shuffle](#tokenize-and-shuffle) 7. [Model Training](#model-training) 8. [Evaluation](#evaluation) 9. [Submission](#submission) 10. [Contributing](#contributing) 11. [How to Cite Us](#how-to-cite-us) 12. [License](#license) ## Introduction [DataComp-LM (DCLM)](https://datacomp.ai/dclm/) is a comprehensive framework designed for building and training large language models (LLMs) with diverse datasets. It offers a standardized corpus of over 300T unfiltered tokens from CommonCrawl, effective pretraining recipes based on the open_lm framework, and an extensive suite of over 50 evaluations. This repository provides tools and guidelines for processing raw data, tokenizing, shuffling, training models, and evaluating their performance. DCLM enables researchers to experiment with various dataset construction strategies across different compute scales, from 411M to 7B parameter models. Our baseline experiments show significant improvements in model performance through optimized dataset design. Already, DCLM has enabled the creation of several high quality datasets that perform well across scales and outperform all open datasets. 
Developing datasets for better models that are cheaper to train. Using DataComp-LM, we develop a high-quality dataset, DCLM-BASELINE, which we use to train models with strong compute performance tradeoffs. We compare on both a Core set of tasks (left) and on MMLU 5-shot (right). DCLM-BASELINE (orange) shows favorable performance relative to both close-source models (crosses) and other open-source datasets and models (circles).
**Submission workflow**: * **(A)** A participant chooses a scale, where larger scales reflect more target training tokens and/or model parameters. The smallest scale is 400m-1x, a 400m parameter model trained compute optimally (1x), and the largest scale is 7B-2x, a 7B parameter model trained with twice the tokens required for compute optimallity. * **(B)** A participant filters a pool of data (filtering track) or mixes data of their own (bring your own data track) to create a dataset. * **(C)** Using the curated dataset, a participant trains a language model, with standardized training code and scale-specific hyperparameters, which is then * **(D)** evaluated on 53 downstream tasks to judge dataset quality.  For more details, please refer to our [paper](https://placeholder-link-to-paper.com). ## Leaderboard The DCLM [leaderboard](https://datacomp.ai/dclm/leaderboard) showcases the performance of models trained on various scales and datasets. The leaderboard is updated regularly with the latest submissions from the community. Below are comparisions of our model with others in the 7B regime. | Model | Params | Tokens | Open dataset? | CORE | MMLU | EXTENDED | |---------------|--------|--------|---------------|----------|----------|----------| | **Open weights, closed datasets** | | | | | | | | Llama2 | 7B | 2T | ✗ | 49.2 | 45.8 | 34.1 | | DeepSeek | 7B | 2T | ✗ | 50.7 | 48.5 | 35.3 | | Mistral-0.3 | 7B | ? | ✗ | 57.0 | 62.7 | 45.1 | | QWEN-2 | 7B | ? | ✗ | 57.5 | **71.9** | 50.5 | | Llama3 | 8B | 15T | ✗ | 57.6 | 66.2 | 46.3 | | Gemma | 8B | 6T | ✗ | 57.8 | 64.3 | 44.6 | | Phi-3 | 7B | ? | ✗ | **61.0** | 69.9 | **57.9** | | **Open weights, open datasets** | | | | | | | | Falcon | 7B | 1T | ✓ | 44.1 | 27.4 | 25.1 | | OLMo-1.7 | 7B | 2.1T | ✓ | 47.0 | 54.0 | 34.2 | | MAP-Neo | 7B | 4.5T | ✓ | **50.2** | **57.1** | **40.4** | | **Models we trained** | | | | | | | | FineWeb edu | 7B | 0.14T | ✓ | 38.7 | 26.3 | 22.1 | | FineWeb edu | 7B | 0.28T | ✓ | 41.9 | 37.3 | 24.5 | | **DCLM-BASELINE** | 7B | 0.14T | ✓ | 44.1 | 38.3 | 25.0 | | **DCLM-BASELINE** | 7B | 0.28T | ✓ | 48.9 | 50.8 | 31.8 | | **DCLM-BASELINE** | 7B | 2.6T | ✓ | **57.1** | **63.7** | **45.4** | ## Getting Started To get started with DCLM, follow these steps: 1. **Clone the repository**: ```bash git clone https://github.com/mlfoundations/DCLM.git cd DCLM ``` 2. **Install dependencies**: ```bash pip install -r requirements.txt ``` Before installing the dependencies, make sure cmake, build-essential, and g++ are installed, e.g., by installing: ````bash apt install cmake build-essential apt install g++-9 update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90 ``` 3. **Set up your environment**: DCLM uses AWS for storage and possible as a compute backend, and ray for distributed processing. Ensure you have the necessary environment variables and configurations for AWS and Ray clusters. We recommend the use of Python 3.10 with DCLM. ## Selecting Raw Sources If you are creating a new source: - Ensure your data is stored in JSONL format (ideally compressed with zstandard). - Key names should be consistent with those in [here](baselines/core/constants.py). - Create a reference JSON in [exp_data/datasets/raw_sources](exp_data/datasets/raw_sources). If you are selecting a raw source for downstream processing: - Identify the raw source you intend to use, which corresponds to a dataset reference (i.e., a JSON in [raw_sources](exp_data/datasets/raw_sources). - The reference JSON contains the URL to the actual data and other metadata used as input for downstream processing. ## Processing the Data To process raw data, follow these steps: 1. **Define a set of processing steps**: Create a pipeline config YAML file specifying the operations. See our [reproduction of C4 for example](baselines/baselines_configs/c4.yaml). Further details on defining a pipeline can be found [here](baselines/README.md). 2. **Launch a Ray cluster**: Use an appropriate Ray cluster based on the size of your dataset and specific YAML configurations. 3. **Run the processing script**: ```bash ray attach