# DataScienceProblems **Repository Path**: kane02/DataScienceProblems ## Basic Information - **Project Name**: DataScienceProblems - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-03-05 - **Last Updated**: 2024-03-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Data Science Problems Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper [Training and Evaluating a Jupyter Notebook Data Science Assistant](https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset. ## Installation This project requires Python 3.6+ and Docker to run. Assuming you have these, to get started first download and install the Python package: ``` $ git clone git@github.com:microsoft/DataScienceProblems.git $ cd DataScienceProblems/src $ pip install -e . ``` ## Usage ### Reading the problems Extract the `juice-github-repos.tar.gz` file from the `DataScienceProblems` repository. ``` $ tar -xvzf juice-github-repos.tar.gz ``` ### Data Schema Here is an example of a notebook context, prompt cell (1 and the markdown), solution cell (2), and unit tests cell (3).  The DSP schema corresponding to this example includes the `prompt` which is the question to be asked to the student, `solution` which is the answer to the question and `test` which is the test case to be run on the student's code. ```python { 'notebook_path': '/path/to/the/notebook.ipynb', 'notebook_problem_index': 0, 'prompt': '%matplotlib inline\n' 'import matplotlib.pyplot as plt\n' 'import numpy as np\n' 'import scipy.optimize as opt\n' '## Hat potential\n' 'The following potential is often used in Physics and other fields ' 'to describe symmetry breaking and is often known as the "hat ' 'potential":\n' '\n' '$$ V(x) = -a x^2 + b x^4 $$\n' '\n' 'Write a function `hat(x,a,b)` that returns the value of this ' 'function:\n', 'solution': 'def hat(x,a=5.0,b=1.0):\n return -a* x*x + b*x**4', 'task_id': 'DSP/414', 'test': 'assert hat(0.0, 1.0, 1.0)==0.0\n' 'assert hat(0.0, 1.0, 1.0)==0.0\n' 'assert hat(1.0, 10.0, 1.0)==-9.0' } ``` We provide a `read_problems` function that can be used to read the problems from the jupyter notebooks. Below is an example of how to use the `read_problems` function and use your generated code samples to save the samples to a file. ```python from data_science_problems.read import read_problems from data_science_problems.utils import write_jsonl problems = read_problems() num_samples = 1 samples = [ dict(task_id=task_id, completion=generate_code(problems[task_id]["prompt"])) for task_id in problems for _ in range(num_samples) ] write_jsonl("samples.jsonl", samples) ``` ### Executing the problems and unit tests Once you have saved the generated samples in the `samples.jsonl` file, you need to build the provided docker container, which would help you safely run the generated samples inside the container. Use the following command to build the docker container. ``` $ docker build --pull --rm -f "Dockerfile" -t datascienceproblems:latest "." ``` Once the Docker container is built, you can execute the generated samples inside the container. You'll need to map the `/app/juice-github-repos` and `/samples/samples.jsonl` directory to the host directory where the notebooks are stored. Use the following command to execute the samples inside the container. ``` $ docker run -it --rm -v $PWD/juice-github-repos:/app/juice-github-repos -v $PWD/samples.jsonl:/samples/samples.jsonl datascienceproblems /samples/samples.jsonl ``` The `docker run` will perform the following things: - It will read the samples from the `samples.jsonl` file. - It will create new notebooks with the generated code samples. The list of new notebooks is saved in the `generates-notebooks.txt` file. - It will execute these new notebooks. - It will compute `pass@k` for generated samples. > **WARNING: Running the `docker run` command with `num_samples = 1` will create with ~1000 new notebooks and save them on your disk. This may take a while.** ``` $ docker run -it --rm -v $PWD/juice-github-repos:/app/juice-github-repos -v $PWD/samples.jsonl:/samples/samples.jsonl datascienceproblems /samples/samples.jsonl 2021-11-02 09:11:11,847 INFO services.py:1164 -- View the Ray dashboard at http://127.0.0.1:8265 Reading the generated samples. 100%|███████████████████████████████████████████████| 305/305 [00:03<00:00, 97.34it/s] Saving to new notebooks with generated samples. 100%|███████████████████████████████████████████████| 305/305 [00:36<00:00, 8.47it/s] Execute the new notebooks with generated samples. 100%|███████████████████████████████████████████████| 2192/2192 [05:17<00:40, 9.49it/s] Complute pass@k for the executed notebooks. 100%|███████████████████████████████████████████████| 2192/2192 [00:28<00:00, 76.73it/s] {'pass@1': ..., 'pass@10': ...} ``` ##### Trademarks > This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. ## Dataset Metadata The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
| property | value | 
|---|---|
| name | Data Science Problems | 
| url | https://github.com/microsoft/DataScienceProblems | 
| sameAs | https://github.com/microsoft/DataScienceProblems | 
| description | 
        Evaluate a natural language code generation model on real data science pedagogical notebooks! 
        Data Science Problems (DSP) includes well-posed data science problems in Markdown along with 
        unit tests to verify correctness and a Docker environment for reproducible execution. About 
        1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only 
        can test a model's ability to chain together complex tasks, but also evaluate the solutions 
        on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant 
        (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other 
        properties of the dataset.
	 | 
| citation | https://arxiv.org/abs/2201.12901 | 
| license | https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt |