# week12 **Repository Path**: yaotengdong/week12 ## Basic Information - **Project Name**: week12 - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-07-21 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## 1、采用Wide and Deep模型,对Criteo提供的Kaggle竞赛数据进行CTR预估。 ## 2、数据描述: 数据共包含11天的数据,其中10天为训练数据train,1天为测试数据test。
文件说明
train.csv: 训练数据。
eval.csv:测试数据
字段说明:字段已进行脱敏处理
I1-I13: 整数型特征
C1-C26:类别型特征,已进行Hash编码
clicked:是否被点击
## 3、评价标准:
项目采用logloss评价模型性能。令y_i为第i个样本真正的标签,y ̂_i为模型对第i个的预测值,N为样本数据,则logloss为
logloss=-1/N ∑_(i=1)^N▒〖y_i log⁡(y ̂_i)〖(1-y〗_i)log⁡(1-y ̂_i)〗 。
其他参考资料:
TensorFlow Wide And Deep 模型详解与应用:https://cloud.tencent.com/developer/article/1143316
中文版:https://www.helplib.com/GitHub/article_153816
Wide and Deep模型分类例子:https://github.com/tensorflow/models/tree/master/official/wide_deep
https://github.com/gutouyu/ML_CIA/tree/master/Wide&Deep
一个包含多个深度模型的用于CTR的工具包:DeepCTR
https://github.com/shenweichen/DeepCTR
中文版:https://zhuanlan.zhihu.com/p/53231955
## 代码实现: ### 1.查看原始数据,没有label, 所以根据说明给数据进行分类处理,数据数据为I1 - I13,类别型数据为C1 - C26 ``` CONTINUOUS_COLUMNS = ["I"+str(i) for i in range(1,14)] # 1-13 inclusive CATEGORICAL_COLUMNS = ["C"+str(i) for i in range(1,27)] # 1-26 inclusive LABEL_COLUMN = ["clicked"] TRAIN_DATA_COLUMNS = LABEL_COLUMN + CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS # TEST_DATA_COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS FEATURE_COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS ``` ### 2. 创建特征列 #### 创建类别型特征列 ``` wide_columns = [] for name in CATEGORICAL_COLUMNS: wide_columns.append(tf.contrib.layers.sparse_column_with_hash_bucket( name, hash_bucket_size=1000)) ``` #### 创建数值型特征列 ``` deep_columns = [] for name in CONTINUOUS_COLUMNS: deep_columns.append(tf.contrib.layers.real_valued_column(name)) ``` #### 创建数值与类别组合的特征列(把类别型特征列嵌入到数值型特征) ``` for col in wide_columns: deep_columns.append(tf.contrib.layers.embedding_column(col, dimension=8)) ``` ### 3. 创建训练模型 创建三类模型:1)类别型数据分类的线性分类器;2)数值型特征使用的深度神经网络分类器;3)组合数据使用的组合线性和深度分类器
``` # Linear Classifier if model_type == 'WIDE': m = tf.contrib.learn.LinearClassifier( model_dir=model_dir, feature_columns=wide_columns) # Deep Neural Net Classifier if model_type == 'DEEP': m = tf.contrib.learn.DNNClassifier( model_dir=model_dir, feature_columns=deep_columns, hidden_units=[100, 50, 25]) # Combined Linear and Deep Classifier if model_type == 'WIDE_AND_DEEP': m = tf.contrib.learn.DNNLinearCombinedClassifier( model_dir=model_dir, linear_feature_columns=wide_columns, dnn_feature_columns=deep_columns, dnn_hidden_units=[100, 70, 50, 25], config=runconfig) ``` ### 4. 利用生成好的模型,对数据进行训练: ``` %%time # This can be found with # wc -l train.csv train_sample_size = 800000 train_steps = train_sample_size/BATCH_SIZE # 8000/40 = 200 model_w_n_d.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps) model_w.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps) model_d.fit(input_fn=generate_input_fn(train_file, BATCH_SIZE), steps=train_steps) print('fit done') ``` ### 5. 对eval数据进行评估验证: 1)Wide and Deep 模型 ``` %%time eval_sample_size = 200000 # this can be found with a 'wc -l eval.csv' eval_steps = eval_sample_size/BATCH_SIZE # 2000/40 = 50 results = model_w_n_d.evaluate(input_fn=generate_input_fn(eval_file), steps=eval_steps) print('evaluate done') print('Accuracy: %s' % results['accuracy']) print(results) ``` 结果:'loss': 0.4962089, 'accuracy': 0.768985
2)Wide 模型 ``` results = model_w.evaluate(input_fn=generate_input_fn(eval_file), steps=eval_steps) print('evaluate done') print('Accuracy: %s' % results['accuracy']) print(results) ``` 结果:'loss': 0.50244325, 'accuracy': 0.766125
3)Deep 模型 ``` results = model_d.evaluate(input_fn=generate_input_fn(eval_file), steps=eval_steps) print('evaluate done') print('Accuracy: %s' % results['accuracy']) print(results) ``` 结果:'loss': 0.5411806, 'accuracy': 0.75249
所以综合验证,Wide and Deep 模型的效果(loss:0.4962089)比单一的模型效果好很多(wide loss:0.50244325, deep loss: 0.5411806)。