# CodeFeedback-Filtered-Instruction **Repository Path**: hf-datasets/CodeFeedback-Filtered-Instruction ## Basic Information - **Project Name**: CodeFeedback-Filtered-Instruction - **Description**: Mirror of https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-04-17 - **Last Updated**: 2024-06-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- language: - en pipeline_tag: text-generation tags: - code license: apache-2.0 task_categories: - question-answering size_categories: - 10K OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

[🏠Homepage] | [🛠️Code]

## OpenCodeInterpreter OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and related work, refer to our paper: ["OpenCodeInterpreter: A System for Enhanced Code Generation and Execution"](https://arxiv.org/abs/2402.14658) available on arXiv. ## Dataset Description CodeFeedback-Filtered-Instruction is a curated collection of code instruction queries extracted from four prominent open-source code instruction tuning datasets: [Magicoder-OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K), [Python code subset of ShareGPT](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT), [Magicoder-Evol-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K), and [Evol-Instruct-Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1). Initially, 287k queries were aggregated from these datasets. To isolate the most intricate and informative instructions, a rigorous filtering process was employed. This involved utilizing the Qwen-72B-Chat, an open-source chat model, for selective filtering. The code queries are evaluated along with their corresponding responses within the compiled datasets by the LLM, assigning a complexity score ranging from 1 to 5, and only those rated 4 or 5 were retained for the seed set. This meticulous filtering process resulted in a final collection of 156k high-quality single-turn code instructions. In subsequent processing steps mentioned in the paper, besides Single-turn Packing, we exclusively utilized queries without considering responses. However, here we retained all responses to provide users with more convenient usage options. ## Contact If you have any inquiries, please feel free to raise an issue or reach out to us via email at: xiangyue.work@gmail.com, zhengtianyu0428@gmail.com. We're here to assist you! ⚠️The dataset contains part data generated by OpenAI's language models, please pay attention to OpenAI's usage policy when adopting this dataset: https://openai.com/policies/usage-policies.