# week12
**Repository Path**: yaotengdong/week12
## Basic Information
- **Project Name**: week12
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2019-07-21
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## 1、采用Wide and Deep模型,对Criteo提供的Kaggle竞赛数据进行CTR预估。
## 2、数据描述:
数据共包含11天的数据,其中10天为训练数据train,1天为测试数据test。
文件说明
train.csv: 训练数据。
eval.csv:测试数据
字段说明:字段已进行脱敏处理
I1-I13: 整数型特征
C1-C26:类别型特征,已进行Hash编码
clicked:是否被点击
## 3、评价标准:
项目采用logloss评价模型性能。令y_i为第i个样本真正的标签,y ̂_i为模型对第i个的预测值,N为样本数据,则logloss为
logloss=-1/N ∑_(i=1)^N▒〖y_i log(y ̂_i)〖(1-y〗_i)log(1-y ̂_i)〗 。
其他参考资料:
TensorFlow Wide And Deep 模型详解与应用:https://cloud.tencent.com/developer/article/1143316
中文版:https://www.helplib.com/GitHub/article_153816
Wide and Deep模型分类例子:https://github.com/tensorflow/models/tree/master/official/wide_deep
https://github.com/gutouyu/ML_CIA/tree/master/Wide&Deep
一个包含多个深度模型的用于CTR的工具包:DeepCTR
https://github.com/shenweichen/DeepCTR
中文版:https://zhuanlan.zhihu.com/p/53231955
## 代码实现:
### 1.查看原始数据,没有label, 所以根据说明给数据进行分类处理,数据数据为I1 - I13,类别型数据为C1 - C26
```
CONTINUOUS_COLUMNS = ["I"+str(i) for i in range(1,14)] # 1-13 inclusive
CATEGORICAL_COLUMNS = ["C"+str(i) for i in range(1,27)] # 1-26 inclusive
LABEL_COLUMN = ["clicked"]
TRAIN_DATA_COLUMNS = LABEL_COLUMN + CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS
# TEST_DATA_COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS
FEATURE_COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS
```
### 2. 创建特征列
#### 创建类别型特征列
```
wide_columns = []
for name in CATEGORICAL_COLUMNS:
wide_columns.append(tf.contrib.layers.sparse_column_with_hash_bucket(
name, hash_bucket_size=1000))
```
#### 创建数值型特征列
```
deep_columns = []
for name in CONTINUOUS_COLUMNS:
deep_columns.append(tf.contrib.layers.real_valued_column(name))
```
#### 创建数值与类别组合的特征列(把类别型特征列嵌入到数值型特征)
```
for col in wide_columns:
deep_columns.append(tf.contrib.layers.embedding_column(col,
dimension=8))
```
### 3. 创建训练模型
创建三类模型:1)类别型数据分类的线性分类器;2)数值型特征使用的深度神经网络分类器;3)组合数据使用的组合线性和深度分类器
```
# Linear Classifier
if model_type == 'WIDE':
m = tf.contrib.learn.LinearClassifier(
model_dir=model_dir,
feature_columns=wide_columns)
# Deep Neural Net Classifier
if model_type == 'DEEP':
m = tf.contrib.learn.DNNClassifier(
model_dir=model_dir,
feature_columns=deep_columns,
hidden_units=[100, 50, 25])
# Combined Linear and Deep Classifier
if model_type == 'WIDE_AND_DEEP':
m = tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=wide_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 70, 50, 25],
config=runconfig)
```
### 4. 利用生成好的模型,对数据进行训练:
```
%%time
# This can be found with
# wc -l train.csv
train_sample_size = 800000
train_steps = train_sample_size/BATCH_SIZE # 8000/40 = 200
model_w_n_d.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps)
model_w.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps)
model_d.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps)
print('fit done')
```
### 5. 对eval数据进行评估验证:
1)Wide and Deep 模型
```
%%time
eval_sample_size = 200000 # this can be found with a 'wc -l eval.csv'
eval_steps = eval_sample_size/BATCH_SIZE # 2000/40 = 50
results = model_w_n_d.evaluate(input_fn=generate_input_fn(eval_file),
steps=eval_steps)
print('evaluate done')
print('Accuracy: %s' % results['accuracy'])
print(results)
```
结果:'loss': 0.4962089, 'accuracy': 0.768985
2)Wide 模型
```
results = model_w.evaluate(input_fn=generate_input_fn(eval_file),
steps=eval_steps)
print('evaluate done')
print('Accuracy: %s' % results['accuracy'])
print(results)
```
结果:'loss': 0.50244325, 'accuracy': 0.766125
3)Deep 模型
```
results = model_d.evaluate(input_fn=generate_input_fn(eval_file),
steps=eval_steps)
print('evaluate done')
print('Accuracy: %s' % results['accuracy'])
print(results)
```
结果:'loss': 0.5411806, 'accuracy': 0.75249
所以综合验证,Wide and Deep 模型的效果(loss:0.4962089)比单一的模型效果好很多(wide loss:0.50244325, deep loss: 0.5411806)。