# cleanlab_nlp_keras

**Repository Path**: qq874455953/cleanlab_nlp_keras

## Basic Information

- **Project Name**: cleanlab_nlp_keras
- **Description**: No description available
- **Primary Language**: Python
- **License**: MulanPSL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 5
- **Forks**: 3
- **Created**: 2021-12-23
- **Last Updated**: 2024-05-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README



# 数据集

https://github.com/fate233/toutiao-multilevel-text-classfication-dataset

今日头条新闻标题, 多标签分类语料, 约300w-语料, 1000+类别;



选择原因：

也是类似爬取的方式选数据， 头条应该也是根据模型来进行判断， 所以模型本身也是一个有噪声的数据集合，适合去噪。  同时为了防止模型是模拟头条本身的分类模型， 所以需要进行去噪，去除噪声数据集。







# 技术选型

baseline 模型

BERT  encoder  + CNN + solfmax分类



# 实验过程

实验环境

tf 1.15 + kreas 2.3.1 + bert4keras0.88

3090（cuda11.3）

cleanlab==1.0.1 

pip install cleanlab==1.0.1  -i https://pypi.tuna.tsinghua.edu.cn/simple

## EDA

原数据集有效条数（新闻关键词，新闻label 这些都有）

2906829  ， 其中没有291w是因为其中有几千条数据没有关键词或者新闻label



 其中 一个样本可能存在多个类别， 按照  存在的多个类别按照 **”，“** 进行区分，

我这里 对于多个类别 存在的情况， 选取第一个类别

转换成一级类目 二级类目





## 数据集生成

一共428个类目  二级

![image-20211117115940283](./img/image-20211117115940283.png)

经过蓄水池抽样得到每个二级类目

![image-20211117145404419](./img/image-20211117145404419.png)

## 噪声抽样统计

```todo```



## Baseline 模型

Baseline模型为

**Albert 第一层 encoder 的 cls 标签  +   Linear**



```python

def build_bert_cls():
    # 加载预训练模型
    bert = build_transformer_model(
        config_path=config_path,
        checkpoint_path=checkpoint_path,
        model='albert',
        return_keras_model=False,
    )

    model = Model(
        inputs=bert.model.input,
        outputs=[
            bert.model.layers[-4].get_output_at(1),
        ]
    )
    output = model.output
	#取出第一层encoder
    output = Lambda(lambda x: x[:, 0], name='CLS-token')(output)
   
    output = Dense(
        units=num_classes,
        activation='softmax',
        kernel_initializer=bert.initializer
    )(output)

    model = keras.models.Model(bert.model.input, output)
    model.summary()

    AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR')

    model.compile(
        loss='sparse_categorical_crossentropy',
        # optimizer=Adam(1e-5),  # 用足够小的学习率
        optimizer=AdamLR(learning_rate=1e-4, lr_schedule={
            1000: 1,
            2000: 0.1
        }),
        metrics=['accuracy'],
    )

    return mode

```



## 置信去噪

### K轮交叉验证

对总体数据进行K轮切分

```python
 for i in range(k):
        step = int(len(total_data) / k)

        test_data = total_data[i*step : (i+1)*step]
        train_data = total_data[0: i*step :] + total_data[ (i+1)*step : ]
```

### 计算概率

```
numpy_array_of_predicted_probabilities = model.predict(test_generator.fortest(), steps = len(test_generator))

y_pred = np.argmax(numpy_array_of_predicted_probabilities, axis=-1)


```

```
from cleanlab.pruning import get_noise_indices

ordered_label_errors = get_noise_indices(
    s=test_label,
    psx=numpy_array_of_predicted_probabilities,
    sorted_index_method='normalized_margin',  # Orders label errors
)
```

### 计算去除样本

```
for i in range(len(test_data)):
    if i not in ordered_label_errors:
        cleaned_data.append(test_data[i])
    else:
        noise_data.append(test_data[i])
```

### 保存去噪后样本

todo



# 实验效果

**todo**  发现训练去噪效果一般 ， 可能是数据集是没什么噪声，所以将尝试自己制造噪声

## 去噪前后比较



## 去除噪声比例



## 数据集训练效果





## 层次分类