# vip-chatbot

**Repository Path**: greitzmann/vip-chatbot

## Basic Information

- **Project Name**: vip-chatbot
- **Description**: 任务型对话系统（Task-based Dialogue System）
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2020-11-09
- **Last Updated**: 2021-11-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 任务型对话系统


## 1. Rasa Review

*   **使用Rasa框架进行二次开发,完成任务型的对话系统搭建。**

* （1）进入[rasa官网](https://rasa.com/)了解rasa的详情；

* （2）了解rasa基础模型文件：[Rasa-nlu](https://github.com/RasaHQ/rasa_nlu) 和 [Rasa-core](https://github.com/RasaHQ/rasa_core)

* （3）Rasa的安装：在Linux或Mac OS中安装较为方便，而Windows安装需要进行编译，较为繁杂。
  
  ```
  pip install rasa_core==0.9.8
  pip install -U scikit-learn sklearn-crfsuite
  pip install git+https://github.com/mit-nlp/MITIE.git
  pip install jieba
  ```
* （4）Rasa的对话流程pipeline：
  
  ```yaml
  language: "zh"

  pipeline:
    - name: "nlp_mitie"  # 命名实体识别，词向量训练
      model: "data/total_word_feature_extractor.dat"  # 加载通过mitie预训练的词向量模型
    - name: "tokenizer_jieba"  # 结巴分词
      dictionary_path: "nlu_data/jieba_dictionary.txt"  # jieba自定义词典
    - name: "ner_mitie"  # 实体识别
    - name: "ner_synonyms"  # 同义词替换
    - name: "intent_entity_featurizer_regex"  # 额外的正则特征
    - name: "intent_featurizer_mitie"  # 意图特征提取（通过词向量，把每个词的词向量相加后取平均，作为句子特征的表示，作为sk-learn的输入）
    - name: "intent_classifier_sklearn"  # 意图识别分类器
  ```


## 2. 项目搭建

* 2.1 项目目录
  
  ```
  vip-chatbot
    |——consolution
        |——answer  # 问答库相关映射文件
        |    |——qa.json  # 正常问答时，action到答案的映射文件
        |    |——qa_by_entity.json  # 单轮Fallback时，实体与相关问题和答案的映射文件
        |    |——qa_by_intent.json  # 单轮Fallback时，意图与相关问题和答案的映射文件
        |——core_data  
        |    |——domain.yml  # 定义意图，实体，槽，action，模板
        |    |——story.md  # 意图与action的故事脚本
        |——models  # 训练后保存的模型
        |    |——nlu  # 训练好的rasa-nlu意图分类模型
        |    |——dialogue  # 训练好的rasa-core模型
        |——nlu_data
             |——chatito  # 定义句子模板，用于生成rasa-nlu格式的训练数据
             |——train_data  # 生成后的rasa-nlu意图分类器训练数据
                |——rasa_dataset_training.json  # chatito生成的json格式的样本，定义了同义词
                |——regex.json  # 定义的正则，用于额外的正则特征提取
        static  # 网页版的咨询机器人
        bot.py  # rasa-nlu和rasa-core训练与rasa对话系统运行接口
        myregex_entity_extrator.py  # 自定义的实体提取类
        pipeline_config.yml  # rasa-nlu的流水线定义文件
        webchat.py  # 网页版机器人启动的python脚本
        vip_action.py  # 执行所有的action，找到最佳答案

  ```

* 2.2 Rasa-nlu训练数据准备
  
  > * （1）确定意图：如办卡方式（banka_fangshi）、查询业务（chaxun_work）、使用范围（use_fanwei）
  > * （2）准备训练数据规则：参考`vip-vhatbot/consolution/nlu_data/chatito`中的格式书写规则文件。该文件由意图句式和同义词词表组成，排列组合从而批量生成rasa格式的训练样本数据。
  > * （3）安装nodejs：进入[Node.js官网](https://nodejs.org/en/)，下载并一路安装，重启终端即可使用npx命令。
  > * （4）生成训练数据：在终端cd到`vip-vhatbot/consolution/nlu_data`目录后，执行`npx chatito chatito --format=rasa`命令，即可在`./nlu_data`中得到rasa的训练数据rasa_dataset_training.json。将该文件放入`vip-vhatbot/consolution/nlu_data/train_data`中。
  > * （5）创建额外正则特征：参考`vip-vhatbot/consolution/nlu_data/train_data/regex.json`中的格式书写正则特征文件，可以使用这些正则特征来增强特征的表示，以用于意图分类。
  > * （6）至此完成训练数据的准备，即可开始训练。

* 2.3 Rasa-core训练数据准备
  
  > * domain.yml：需要定义槽、意图、实体、action和固定的模版返回（用于问候语或多轮）
  >  
    ```yaml
      slots:
        槽名1：
          - type: text
        槽名2：
          - type: text
      intents:
        - 意图名1
        - 意图名2
      entities:
        - 实体名1
        - 实体名2
      templates:
        utter_greet:
          - "Hello"
          - "Hi"
        utter_goodbye:
          - "再见，为您服务很开心^_^"
          - "Bye，下次再见"
      actions:
        - action名1
        - action名2
      ```
  > * story.md：用意图和action构建了会话的训练数据。
  >  
    ```markdown
      ## story greet 故事name，训练用不到，官方文档提示在debug的时候会显示story的名字
      * greet
        - utter_greet

      ## story goodbye
      * goodbye
        - utter_goodbye

      ## story greet goodbye
      * greet
        - utter_greet
      * goodbye
        - utter_goodbye

      ## story inform num
      * inform_num{"num":"1"}  包含的实体
        - Numaction
     ```
  > * vip_action.py：创建预测后的行动到寻找答案的策略文件
  > * myregex_entity_extrator.py：槽实体的正则特征
  > * 至此完成训练数据的准备，即可开始训练。

* 2.4 问答库文件准备：
  
  > * qa.json：将意图与其答案对应起来。 
    ```json
      # 1. action与答案直接对应（以办理方式为例）
      "Bankafangshi":"提供个人身份证原件和电话号码等信息，即可在官网办理会员卡。"
      # 2. action的不同实体与答案一一对应（以查询业务为例）
      "Chaxunwork":{
        "订单":"在XX卡小程序上点击办卡进度即可查看订单。",
        "余额":"在微信公众号，选“其他-个人中心-我的会员卡”-绑定你的会员卡后首页点击会员卡—“账单查询”按钮，进入账单查询界面即可查询余额。
      }
    ```
  > * qa_by_entity.json、qa_by_intent.json：当意图置信度低于阈值时，触发fallback问答，将准备好的问题回复给用户，由用户选择并给予答复，是弥补意图不全或分类不足的方法之一。优先考虑实体相关，其次是意图相关。（这两个文件需要在设计完意图和实体后做）

## 3.模型训练
* Rasa-nlu训练意图分类模型：

  ```python
    def train_nlu():
        from rasa_nlu.training_data.loading import load_data # 新api,会将目录下的所有文件合并
        from rasa_nlu.config import RasaNLUModelConfig #新 API
        from rasa_nlu.model import Trainer
        from rasa_nlu.config import load

        training_data = load_data("nlu_data/train_data")
        trainer = Trainer(load("pipeline_config.yaml")) # load的返回值就是一个RasaNLUModelConfig对象，而且其初始化需要传入的不是文件名，而是读取的配置文件内容，一个字典
        trainer.train(training_data)
        model_directory = trainer.persist("models/", project_name="nlu",fixed_nmodel_name="model_ner_reg_all") # 意图分类模型保存路径

        return model_directory

  ```
* Rasa-core训练action预测分类模型：
  ```python
    def train_dialogue(domain_file="core_data/domain.yml",
                 model_path="models/core/dialogue",
                 training_data_file="core_data/story.md",
                 max_history=3):
        from rasa_core.policies.fallback import FallbackPolicy
        # agent = Agent(domain_file,
        #               policies=[MemoizationPolicy(max_history=2), MobilePolicy()])
        agent = Agent(domain_file, policies=[
            KerasPolicy(MaxHistoryTrackerFeaturizer(BinarySingleStateFeaturizer(),max_history=max_history)),
            FallbackPolicy(fallback_action_name='action_default_fallback',
                           core_threshold=0.3,
                           nlu_threshold=0.3)])
        #如果给的是data的地址，会自动调用load_data
        agent.train(
            training_data_file,
            epochs=200,
            batch_size=16,
            augmentation_factor=50,
            validation_split=0.2
        )

        agent.persist(model_path)
        return agent

    ```
* Demo运行：
    ```
    $ python webchat.py
    ```

## 4.意图分类训练过程详解

- 4.1 训练总控及数据处理：`rasa_nlu/model.py`

```python
    def train(self, data, **kwargs):
        # type: (TrainingData) -> Interpreter
        """Trains the underlying pipeline using the provided training data."""
        # 获取训练数据
        self.training_data = data
        # kwargs就是当你传入key=value时存储的字典
        context = kwargs  # type: Dict[Text, Any]
        #遍历检查组件是否缺失
        for component in self.pipeline:
            updates = component.provide_context()
            if updates:
                context.update(updates)

        # Before the training starts: check that all arguments are provided
        if not self.skip_validation:
            components.validate_arguments(self.pipeline, context)

        # data gets modified internally during the training - hence the copy
        working_data = copy.deepcopy(data)
        # 开始每个组件的训练
        for i, component in enumerate(self.pipeline):
            logger.info("Starting to train component {}"
                        "".format(component.name))
            component.prepare_partial_processing(self.pipeline[:i], context)
            updates = component.train(working_data, self.config,
                                      **context)
            logger.info("Finished training component.")
            if updates:
                context.update(updates)

        return Interpreter(self.pipeline, context)

    # 加载mitie用于训练所有词向量的特征，还有维基百科中文的词向量文件：nlu_data/total_word_feature_extractor.dat
    def provide_context(self):
        type: () -> Dict[Text, Any]
        return {"mitie_feature_extractor": self.extractor,
                "mitie_file": self.component_config.get("model")
```

- 4.2 自定义训练的流程组件

```yaml
language: "zh"

pipeline:
- name: "nlp_mitie" # 初始化MITIE
  model: "nlu_data/yue_total_word_feature_extractor.dat"
- name: "tokenizer_jieba"
  dictionary_path: "nlu_data/jieba_dictionary.txt"
- name: "ner_mitie"
- name: "myregex_entity_extractor.MyRegeexEntityExtractor"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"

```

- 4.3 ner命名实体识别训练组件，得到最优的惩罚系数C：`rasa_nlu/extractors/mitie_entity_extractor.py`

```python
  def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig) -> None
        import mitie
        # 加载预训练好的维基百科词向量文件
        model_file = kwargs.get("mitie_file")
        if not model_file:
            raise Exception("Can not run MITIE entity extractor without a "
                            "language model. Make sure this component is "
                            "preceeded by the 'nlp_mitie' component.")
        # 初始化词向量的训练器
        trainer = mitie.ner_trainer(model_file)
        # 线程数为1
        trainer.num_threads = kwargs.get("num_threads", 1)
        found_one_entity = False

        # filter out pre-trained entity examples
        # 遍历加载训练数据中实体实例
        filtered_entity_examples = self.filter_trainable_entities(
                training_data.training_examples)

        for example in filtered_entity_examples:
            sample = self._prepare_mitie_sample(example)

            found_one_entity = sample.num_entities > 0 or found_one_entity
            trainer.add(sample)

        # Mitie will fail to train if there is not a single entity tagged
        if found_one_entity:
            self.ner = trainer.train()

    # 准备实体训练所需要的数据，并返回分词在文本中的位置信息
    def filter_trainable_entities(self, entity_examples):
        # type: (List[Message]) -> List[Message]
        """Filters out untrainable entity annotations.

        Creates a copy of entity_examples in which entities that have
        `extractor` set to something other than self.name (e.g. 'ner_crf')
        are removed."""
        # 储存所有的训练数据的实体内容信息（实体，意图）及其位置信息（始止）
        filtered = []
        # 遍历json文件中的每个训练数据
        for message in entity_examples:
            entities = []
            # 获取每条训练数据中的所有实体信息
            for ent in message.get("entities", []):
                extractor = ent.get("extractor")
                if not extractor or extractor == self.name:
                    entities.append(ent)
            # 更新实体信息
            data = message.data.copy()
            data['entities'] = entities
            # 如语料‘我要上海明天的天气’中的实体（地点，日期）信息：{'intent': 'weather_address_date-time', 'entities': [{'start': 2, 'end': 4, 'value': '上海', 'entity': 'address'}, {'start': 4, 'end': 6, 'value': '明天', 'entity': 'date-time'}]
            filtered.append(
                Message(text=message.text,
                        data=data,
                        output_properties=message.output_properties,
                        time=message.time))

        return filtered

    def _prepare_mitie_sample(self, training_example):
        import mitie
        # 获取训练数据：‘我要上海明天的天气’
        text = training_example.text
        # 分词后的list：['我要','上海','明天','的','天气']
        tokens = training_example.get("tokens")
        sample = mitie.ner_training_instance([t.text for t in tokens])
        # 遍历语料中的实体，地点和时间：{'start': 2, 'end': 4, 'value': '上海', 'entity': 'address'}, {'start': 4, 'end': 6, 'value': '明天', 'entity': 'date-time'}]
        for ent in training_example.get("entities", []):
            try:
                # if the token is not aligned an exception will be raised
                start, end = MitieEntityExtractor.find_entity(
                        ent, text, tokens)
            except ValueError as e:
                logger.warning("Example skipped: {}".format(str(e)))
                continue
            try:
                # mitie will raise an exception on malicious
                # input - e.g. on overlapping entities
                sample.add_entity(list(range(start, end)), ent["entity"])
            except Exception as e:
                logger.warning("Failed to add entity example "
                               "'{}' of sentence '{}'. Reason: "
                               "{}".format(str(e), str(text), e))
                continue
        return sample

    def train(self):
        if self.size == 0:
            raise Exception("You can't call train() on an empty trainer.")
        # Make the type be a c_void_p so the named_entity_extractor constructor will know what to do.
        # 获取最优C参数的训练
        obj = ctypes.c_void_p(_f.mitie_train_named_entity_extractor(self.__obj))
        if obj is None:
            raise Exception("Unable to create named_entity_extractor.  Probably ran out of RAM")
        return named_entity_extractor(obj)

```

- 4.4 同义词替换训练组件：`rasa_nlu/extractors/entity_synonyms.py`

```python
    def train(self, training_data, config, **kwargs):
        # type: (TrainingData) -> None
        # 获取json数据中的同义词信息，加入到self的synonyms参数当中来
        for key, value in list(training_data.entity_synonyms.items()):
            self.add_entities_if_synonyms(key, value)
        # 将实体词加入到self的entity参数当中来
        for example in training_data.entity_examples:
            for entity in example.get("entities", []):
                entity_val = example.text[entity["start"]:entity["end"]]
                self.add_entities_if_synonyms(entity_val,
                                              str(entity.get("value")))
```

- 4.5 自定义正则特征加强组件：`rasa_nlu/featurizers/regex_featurizer.py`

```python
    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None

        # 加载自定义的正则特征：regex.json
        for example in training_data.regex_features:
            self.known_patterns.append(example)

        for example in training_data.training_examples:
            updated = self._text_features_with_regex(example)
            example.set("text_features", updated)
```

- 4.6 实体特征向量化组件：`rasa_nlu/featurizers/mitie_featurizer.py`

```python
    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None

        mitie_feature_extractor = self._mitie_feature_extractor(**kwargs)
        for example in training_data.intent_examples:
            # 构建向量化特征
            features = self.features_for_tokens(example.get("tokens"),
                                                mitie_feature_extractor)
            example.set("text_features",
                        self._combine_with_existing_text_features(
                                example, features))

```

- 4.7 意图识别分类器训练组件：在`rasa_nlu/classifiers/sklearn_intent_classifier.py`

```python
    def train(self, training_data, cfg, **kwargs):
        # type: (TrainingData, RasaNLUModelConfig, **Any) -> None
        """Train the intent classifier on a data set."""
        # 定义线程数，可否增加，会对训练有什么影响？
        num_threads = kwargs.get("num_threads", 1)
        # 获取训练数据中的意图标签
        labels = [e.get("intent")
                  for e in training_data.intent_examples]
        # 意图标签需要至少两类，否则发出警告
        if len(set(labels)) < 2:
            logger.warn("Can not train an intent classifier. "
                        "Need at least 2 different classes. "
                        "Skipping training of intent classifier.")
        else:
            # 将字符串标签用num来表示
            y = self.transform_labels_str2num(labels)

            # 获取one-hot编码的训练数据
            X = np.stack([example.get("text_features")
                          for example in training_data.intent_examples])
            # 创建训练器
            self.clf = self._create_classifier(num_threads, y)
            # 开始训练
            self.clf.fit(X, y)

    def _create_classifier(self, num_threads, y):
        from sklearn.model_selection import GridSearchCV
        from sklearn.svm import SVC
        # 获取参数调节列表，暂定为[1,2,5,10,20,100]
        C = self.component_config["C"]
        # 使用的是线性核：linear
        kernels = self.component_config["kernels"]
        # dirty str fix because sklearn is expecting
        # str not instance of basestr...
        tuned_parameters = [{"C": C,
                             "kernel": [str(k) for k in kernels]}]

        # aim for 5 examples in each fold
        # 每个fold应该要有5个样例
        cv_splits = self._num_cv_splits(y)
        # 返回网格搜索的训练器
        return GridSearchCV(SVC(C=1,
                                probability=True,
                                class_weight='balanced'),
                            param_grid=tuned_parameters,
                            n_jobs=num_threads,
                            cv=cv_splits,
                            scoring='f1_weighted',
                            verbose=1)

    def _num_cv_splits(self, y):
        folds = self.component_config["max_cross_validation_folds"]
        return max(2, min(folds, np.min(np.bincount(y)) // 5))

```