108 Star 870 Fork 1.5K

MindSpore/models

Create your Gitee Account
Explore and code with more than 13.5 million developers,Free private repositories !:)
Sign up
文件
Clone or Download
contribute
Sync branch
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README

Contents

Wide&Deep Description

Wide&Deep model is a classical model in Recommendation and Click Prediction area. This is an implementation of Wide&Deep as described in the Wide & Deep Learning for Recommender System paper.

Model Architecture

Wide&Deep model jointly trained wide linear models and deep neural network, which combined the benefits of memorization and generalization for recommender systems.

Currently we support host-device mode with multi-dimensional partition parallel for embedding table and parameter server mode, and we implement the cache mode for huge embedding table which cooperated with Noah's Ark Lab(ScaleFreeCTR).

Dataset

Environment Requirements

Quick Start

  1. Clone the Code
git clone https://gitee.com/mindspore/models.git
cd models/official/recommend/Wide_and_Deep
  1. Download the Dataset

Please refer to 1 to obtain the download link

mkdir -p data/origin_data && cd data/origin_data
wget DATA_LINK
tar -zxvf dac.tar.gz
  1. Use this script to preprocess the data. This may take about one hour and the generated mindrecord data is under data/mindrecord.
python src/preprocess_data.py  --data_path=./data/ --dense_dim=13 --slot_dim=26 --threshold=100 --train_line_count=45840617 --skip_id_convert=0
  1. Start Training

Once the dataset is ready, the model can be trained and evaluated on the single device(Ascend) by the command as follows:

python train_and_eval.py --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend

On Ascend910B platform(Ascend910B1, Ascend910B2, Ascend910B3, Ascend910B4), evaluation while training is not supported. Train the model as follow:

python train.py --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend

To evaluate the model, command as follows:

python eval.py  --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend --ckpt_path=./ckpt/widedeep_train-15_2582.ckpt
  • Running on ModelArts (If you want to run in modelarts, please check the official documentation of modelarts, and you can start training as follows)

    # Train 8p on ModelArts
    # (1) Perform a or b.
    #       a. Set "enable_modelarts=True" on default_config.yaml file.
    #          Set "run_distribute=True" on default_config.yaml file.
    #          Set "data_path=/cache/data/criteo_mindrecord/" on default_config.yaml file.
    #          Set other parameters on default_config.yaml file you need.
    #       b. Add "enable_modelarts=True" on the website UI interface.
    #          Add "run_distribute=True" on the website UI interface.
    #          Add "dataset_path=/cache/data/criteo_mindrecord/" on the website UI interface.
    #          Add other parameters on the website UI interface.
    # (2) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
    # (3) Set the code directory to "/path/wide_and_deep" on the website UI interface.
    # (4) Set the startup file to "train.py" on the website UI interface.
    # (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
    # (6) Create your job.
    #
    # Train 1p on ModelArts
    # (1) Perform a or b.
    #       a. Set "enable_modelarts=True" on default_config.yaml file.
    #          Set "dataset_path='/cache/data/criteo_mindrecord/'" on default_config.yaml file.
    #          Set other parameters on default_config.yaml file you need.
    #       b. Add "enable_modelarts=True" on the website UI interface.
    #          Add "dataset_path=/cache/data/criteo_mindrecord/" on the website UI interface.
    #          Add other parameters on the website UI interface.
    # (2) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
    # (3) Set the code directory to "/path/wide_and_deep" on the website UI interface.
    # (4) Set the startup file to "train.py" on the website UI interface.
    # (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
    # (6) Create your job.
    #
    # Eval 1p on ModelArts
    # (1) Perform a or b.
    #       a. Set "enable_modelarts=True" on default_config.yaml file.
    #          Set "ckpt_file='/cache/checkpoint_path/model.ckpt'" on default_config.yaml file.
    #          Set "checkpoint_url='s3://dir_to_trained_ckpt/'" on default_config.yaml file.
    #          Set "dataset_path='/cache/data/criteo_mindrecord/'" on default_config.yaml file.
    #          Set other parameters on default_config.yaml file you need.
    #       b. Add "enable_modelarts=True" on the website UI interface.
    #          Add "ckpt_file=/cache/checkpoint_path/model.ckpt" on the website UI interface.
    #          Add "checkpoint_url=s3://dir_to_trained_ckpt/" on the website UI interface.
    #          Add "dataset_path=/cache/data/criteo_mindrecord/" on the website UI interface.
    #          Add other parameters on the website UI interface.
    # (2) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
    # (3) Set the code directory to "/path/wide_and_deep" on the website UI interface.
    # (4) Set the startup file to "eval.py" on the website UI interface.
    # (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
    # (6) Create your job.
    
  • Export on ModelArts (If you want to run in modelarts, please check the official documentation of modelarts, and you can start evaluating as follows)

  1. Export s8 multiscale and flip with voc val dataset on modelarts, evaluating steps are as follows:

    # (1) Perform a or b.
    #       a. Set "enable_modelarts=True" on base_config.yaml file.
    #          Set "file_name='wide_and_deep'" on base_config.yaml file.
    #          Set "file_format='MINDIR'" on base_config.yaml file.
    #          Set "checkpoint_url='/The path of checkpoint in S3/'" on beta_config.yaml file.
    #          Set "ckpt_file='/cache/checkpoint_path/model.ckpt'" on base_config.yaml file.
    #          Set other parameters on base_config.yaml file you need.
    #       b. Add "enable_modelarts=True" on the website UI interface.
    #          Add "file_name='wide_and_deep'" on the website UI interface.
    #          Add "file_format='MINDIR'" on the website UI interface.
    #          Add "checkpoint_url='/The path of checkpoint in S3/'" on the website UI interface.
    #          Add "ckpt_file='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
    #          Add other parameters on the website UI interface.
    # (2) Upload or copy your trained model to S3 bucket.
    # (3) Set the code directory to "/path/wide_and_deep" on the website UI interface.
    # (4) Set the startup file to "export.py" on the website UI interface.
    # (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
    # (6) Create your job.
    

Script Description

Script and Sample Code

└── wide_and_deep
    ├── eval.py
    ├── README.md
    ├── script
    │   ├── cluster_32p.json
    │   ├── common.sh
    │   ├── deploy_cluster.sh
    │   ├── run_auto_parallel_train_cluster.sh
    │   ├── run_auto_parallel_train.sh
    │   ├── run_multigpu_train.sh
    │   ├── run_multinpu_train.sh
    │   ├── run_parameter_server_train_cluster.sh
    │   ├── run_parameter_server_train.sh
    │   ├── run_standalone_train_for_gpu.sh
    │   └── start_cluster.sh
    ├── src
    │   ├── callbacks.py
    │   ├── datasets.py
    │   ├── generate_synthetic_data.py
    │   ├── __init__.py
    │   ├── metrics.py
    │   ├── preprocess_data.py
    │   ├── process_data.py
    │   ├── wide_and_deep.py
    │   └── model_utils
    │       ├── config.py                         # Processing configuration parameters
    │       ├── device_adapter.py                 # Get cloud ID
    │       ├── local_adapter.py                  # Get local ID
    │       └── moxing_adapter.py                 # Parameter processing
    ├── default_config.yaml                       # Training parameter profile
    ├── train_and_eval_auto_parallel.py
    ├── train_and_eval_distribute.py
    ├── train_and_eval_parameter_server.py
    ├── train_and_eval.py
    └── train.py
    └── export.py

Script Parameters

Training Script Parameters

The parameters is same for train.py,train_and_eval.py ,train_and_eval_distribute.py and train_and_eval_auto_parallel.py

usage: train.py [-h] [--device_target {Ascend,GPU}] [--data_path DATA_PATH]
                [--epochs EPOCHS] [--full_batch FULL_BATCH]
                [--batch_size BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
                [--field_size FIELD_SIZE] [--vocab_size VOCAB_SIZE]
                [--emb_dim EMB_DIM]
                [--deep_layer_dim DEEP_LAYER_DIM [DEEP_LAYER_DIM ...]]
                [--deep_layer_act DEEP_LAYER_ACT] [--keep_prob KEEP_PROB]
                [--dropout_flag DROPOUT_FLAG] [--output_path OUTPUT_PATH]
                [--ckpt_path CKPT_PATH] [--eval_file_name EVAL_FILE_NAME]
                [--loss_file_name LOSS_FILE_NAME]
                [--host_device_mix HOST_DEVICE_MIX]
                [--dataset_type DATASET_TYPE]
                [--parameter_server PARAMETER_SERVER]

optional arguments:
  --device_target {Ascend,GPU}        device where the code will be implemented. (Default:Ascend)
  --data_path DATA_PATH               This should be set to the same directory given to the
                                      data_download's data_dir argument
  --epochs EPOCHS                     Total train epochs. (Default:15)
  --full_batch FULL_BATCH             Enable loading the full batch. (Default:False)
  --batch_size BATCH_SIZE             Training batch size.(Default:16000)
  --eval_batch_size                   Eval batch size.(Default:16000)
  --field_size                        The number of features.(Default:39)
  --vocab_size                        The total features of dataset.(Default:200000)
  --emb_dim                           The dense embedding dimension of sparse feature.(Default:80)
  --deep_layer_dim                    The dimension of all deep layers.(Default:[1024,512,256,128])
  --deep_layer_act                    The activation function of all deep layers.(Default:'relu')
  --keep_prob                         The keep rate in dropout layer.(Default:1.0)
  --dropout_flag                      Enable dropout.(Default:0)
  --output_path                       Deprecated
  --ckpt_path                         The location of the checkpoint file. If the checkpoint file
                                      is a slice of weight, multiple checkpoint files need to be
                                      transferred. Use ';' to separate them and sort them in sequence
                                      like "./checkpoints/0.ckpt;./checkpoints/1.ckpt".
                                      (Default:./checkpoints/)
  --eval_file_name                    Eval output file.(Default:eval.og)
  --loss_file_name                    Loss output file.(Default:loss.log)
  --host_device_mix                   Enable host device mode or not.(Default:0)
  --dataset_type                      The data type of the training files, chosen from tfrecord/mindrecord/hd5.(Default:tfrecord)
  --parameter_server                  Open parameter server of not.(Default:0)
  --vocab_cache_size                  Enable cache mode.(Default:0)

Preprocess Script Parameters

usage: generate_synthetic_data.py [-h] [--output_file OUTPUT_FILE]
                                  [--label_dim LABEL_DIM]
                                  [--number_examples NUMBER_EXAMPLES]
                                  [--dense_dim DENSE_DIM]
                                  [--slot_dim SLOT_DIM]
                                  [--vocabulary_size VOCABULARY_SIZE]
                                  [--random_slot_values RANDOM_SLOT_VALUES]
optional arguments:
  --output_file                        The output path of the generated file.(Default: ./train.txt)
  --label_dim                          The label category. (Default:2)
  --number_examples                    The row numbers of the generated file. (Default:4000000)
  --dense_dim                          The number of the continue feature.(Default:13)
  --slot_dim                           The number of the category features.(Default:26)
  --vocabulary_size                    The vocabulary size of the total dataset.(Default:400000000)
  --random_slot_values                 0 or 1. If 1, the id is generated by the random. If 0, the id is set by the row_index mod           part_size, where part_size is the vocab size for each slot
usage: preprocess_data.py [-h]
                          [--data_path DATA_PATH] [--dense_dim DENSE_DIM]
                          [--slot_dim SLOT_DIM] [--threshold THRESHOLD]
                          [--train_line_count TRAIN_LINE_COUNT]
                          [--skip_id_convert {0,1}]

  --data_path                         The path of the data file.
  --dense_dim                         The number of your continues fields.(default: 13)
  --slot_dim                          The number of your sparse fields, it can also be called category features.(default: 26)
  --threshold                         Word frequency below this value will be regarded as OOV. It aims to reduce the vocab size.           (default: 100)
  --train_line_count                  The number of examples in your dataset.
  --skip_id_convert                   0 or 1. If set 1, the code will skip the id convert, regarding the original id as the final id.(default: 0)

Dataset Preparation

Process the Real World Data

  1. Download the Dataset and place the raw dataset under a certain path, such as: ./data/origin_data
mkdir -p data/origin_data && cd data/origin_data
wget DATA_LINK
tar -zxvf dac.tar.gz

Please refer to 1 to obtain the download link

  1. Use this script to preprocess the data
python src/preprocess_data.py  --data_path=./data/ --dense_dim=13 --slot_dim=26 --threshold=100 --train_line_count=45840617 --skip_id_convert=0

Generate and Process the Synthetic Data

  1. The following command will generate 40 million lines of click data, in the format of

"label\tdense_feature[0]\tdense_feature[1]...\tsparse_feature[0]\tsparse_feature[1]...".

mkdir -p syn_data/origin_data
python src/generate_synthetic_data.py --output_file=syn_data/origin_data/train.txt --number_examples=40000000 --dense_dim=13 --slot_dim=51 --vocabulary_size=2000000000 --random_slot_values=0
  1. Preprocess the generated data
python src/preprocess_data.py --data_path=./syn_data/  --dense_dim=13 --slot_dim=51 --threshold=0 --train_line_count=40000000 --skip_id_convert=1

Training Process

SingleDevice

To train and evaluate the model, command as follows:

python train_and_eval.py --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend

SingleDevice For Cache Mode

To train and evaluate the model, command as follows:

python train_and_eval.py --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend --sparse=True --vocab_size=200000 --vocab_cache_size=160000

Distribute Training

To train the model in data distributed training, command as follows:

# configure environment path before training
bash run_multinpu_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE

To train the model in model parallel training, commands as follows:

# configure environment path before training
bash run_auto_parallel_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE

To train the model in clusters, command as follows:'''

# deploy wide&deep script in clusters
# CLUSTER_CONFIG is a json file, the sample is in script/.
# EXECUTE_PATH is the scripts path after the deploy.
bash deploy_cluster.sh CLUSTER_CONFIG_PATH EXECUTE_PATH

# enter EXECUTE_PATH, and execute start_cluster.sh as follows.
# MODE: "host_device_mix"
bash start_cluster.sh CLUSTER_CONFIG_PATH EPOCH_SIZE VOCAB_SIZE EMB_DIM
                      DATASET ENV_SH RANK_TABLE_FILE MODE

Parameter Server

To train and evaluate the model in parameter server mode, command as follows:'''

# SERVER_NUM is the number of parameter servers for this task.
# SCHED_HOST is the IP address of scheduler.
# SCHED_PORT is the port of scheduler.
# The number of workers is the same as RANK_SIZE.
bash run_parameter_server_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE SERVER_NUM SCHED_HOST SCHED_PORT

Parameter Server training does not support PyNative mode.

Evaluation Process

To evaluate the model, command as follows:

python eval.py --data_path=./data/mindrecord --dataset_type=mindrecord --device_target=Ascend --ckpt_path=./ckpt/widedeep_train-15_2582.ckpt

Inference Process

Before inference, please refer to MindSpore Inference with C++ Deployment Guide to set environment variables.

Export MindIR

python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --device_target [DEVICE_TARGET] --file_format [FILE_FORMAT]

The ckpt_file parameter is required, FILE_FORMAT should be in ["AIR", "MINDIR"]

Infer

Before performing inference, the mindir file must be exported by export.py script. We only provide an example of inference using MINDIR model.

bash run_infer_cpp.sh [MINDIR_PATH] [DATASET_PATH] [DATA_TYPE] [NEED_PREPROCESS] [DEVICE_TYPE] [DEVICE_ID]
  • DATA_TYPE means dataset type, it's value is ['tfrecord', 'mindrecord', 'hd5'].
  • NEED_PREPROCESS means weather need preprocess or not, it's value is 'y' or 'n'.
  • DEVICE_ID is optional, default value is 0.

result

Inference result is saved in current path, you can find result like this in acc.log file.

================================================================================ auc : 0.8080494136248402

Model Description

Performance

Training Performance

Parameters Single
Ascend
Single
GPU
Data-Parallel-8P Host-Device-mode-8P
Resource Ascend 910; OS Euler2.8 Tesla V100-PCIE 32G Ascend 910; OS Euler2.8 Ascend 910; OS Euler2.8
Uploaded Date 07/05/2021 (month/day/year) 07/05/2021 (month/day/year) 07/05/2021 (month/day/year) 07/05/2021 (month/day/year)
MindSpore Version 1.3.0 1.3.0 1.3.0 1.3.0
Dataset 1 1 1 1
Training Parameters Epoch=15,
batch_size=16000
Epoch=15,
batch_size=16000
Epoch=15,
batch_size=16000
Epoch=15,
batch_size=16000
Optimizer FTRL,Adam FTRL,Adam FTRL,Adam FTRL,Adam
Loss Function SigmoidCrossEntroy SigmoidCrossEntroy SigmoidCrossEntroy SigmoidCrossEntroy
AUC Score 0.80937 0.80971 0.80862 0.80834
Speed 20.906 ms/step 24.465 ms/step 27.388 ms/step 236.506 ms/step
Loss wide:0.433,deep:0.444 wide:0.444, deep:0.456 wide:0.437, deep: 0.448 wide:0.444, deep:0.444
Params(M) 75.84 75.84 75.84 75.84
Checkpoint for inference 233MB(.ckpt file) 230MB(.ckpt) 233MB(.ckpt file) 233MB(.ckpt file)

All executable scripts can be found in here

Note: The result of GPU is tested under the master version. The parameter server mode of the Wide&Deep model is still under development.

Evaluation Performance

Parameters Wide&Deep
Resource Ascend 910; OS Euler2.8
Uploaded Date 07/05/2021 (month/day/year)
MindSpore Version 1.3.0
Dataset [1]
Batch Size 16000
Outputs AUC
Accuracy AUC=0.809

Ultimate performance experience

MindSpore support numa bind feature to get better performance from v1.1.1. Need to install numa library:

  • ubuntu : sudo apt-get install libnuma-dev
  • centos/euleros : sudo yum install numactl-devel

v1.1.1 support config interface to open numa bind feature:

import mindspore.dataset as de de.config.set_numa_enable(True)

v1.2.0 support environment variable further to open numa bind feature:

export DATASET_ENABLE_NUMA=True

Description of Random Situation

There are three random situations:

  • Shuffle of the dataset.
  • Initialization of some model weights.
  • Dropout operations.

ModelZoo Homepage

Please check the official homepage.

马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models
r2.1

Search