# chase

**Repository Path**: yalongfan/chase

## Basic Information

- **Project Name**: chase
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2021-09-26
- **Last Updated**: 2021-11-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. This repo contains our dataset CHASE.

### Citation


### Data Content and Format

#### Question, SQL, and Parsed SQL

Each file in`train.json` and `dev.json` contains the following fields:
- `database_id`: the database id to which this interaction is addressed.
- `interaction`: the query interaction including multiple DB query questions. For each question in the interaction, it includes:
  - `utterance`: the natural language question
  - `utterance_toks`: the natural language question tokens
  - `query`: the SQL query corresponding to the question. 
  - `sql`: parsed results of this SQL query using `process_sql.py`. Please refer to the [Spider Github page](https://github.com/taoyds/spider) for the detailed documentation.

```
    {
        "database_id": "party_host",
        "interaction": [
            {
                "utterance": "主办方都有谁？",
                "utterance_toks": [
                    "主",
                    "办",
                    "方",
                    ...
                    "？"
                ],
                "query": "select 姓名 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "他们来自哪些不同的国家？",
                "utterance_toks": [
                    "他",
                    "们",
                    ...
                    "？"
                ],
                "query": "select distinct 国籍 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "每个国家有多少个主办方？",
                "utterance_toks": [
                    "每",
                    "个",
                    "国",
                    "家",
                    ...
                    "？"
                ],
                "query": "select 国籍 , count(*) from 主办方 group by 国籍",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            }
        ]
    }
```

#### Tables

`tables.json` contains the following information for each database:
- `db_id`: database id
- `table_names_original`: original table names stored in the database.
- `table_names`: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
- `column_names_original`: original column names stored in the database. Each column looks like: `[0, "派对主题"]`. `0` is the index of table names in `table_names`, which is `"派对"` in this case. `"派对主题"` is the column name. 
- `column_names`: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
- `column_types`: data type of each column
- `foreign_keys`: foreign keys in the database. `[11, 7]` means column indices in the `column_names`. These two columns are foreign keys of two different tables.
- `primary_keys`: primary keys in the database. Each number is the index of `column_names`.

```
    {
        "db_id": "party_host",
        "table_names_original": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "table_names": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "column_names_original": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_names": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_types": [
            "text",
            "number",
            "text",
            "text",
            ...
        ],
        "foreign_keys": [
            [
                11,
                1
            ],
            [
                12,
                7
            ]
        ],
        "primary_keys": [
            1,
            7,
            11
        ]
    }
```