Licensed under the MIT License.

Impressions

Train, hyperparameter tune, and deploy with PyTorch

In this tutorial, you will train, hyperparameter tune, and deploy a PyTorch model using the Azure Machine Learning (Azure ML) Python SDK.

This tutorial will train an image classification model using transfer learning, based on PyTorch's Transfer Learning tutorial. The model is trained to classify chickens and turkeys by first using a pretrained ResNet18 model that has been trained on the ImageNet dataset.

Prerequisites

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the Configuration notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace

# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

Initialize workspace

Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Create or Attach existing AmlCompute

You will need to create a compute target for training your model. In this tutorial, we use Azure ML managed compute (AmlCompute) for our remote training compute resource.

Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the vm_size parameter, such as STANDARD_D2_V2.

Train model on the remote compute

Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time.

Create a project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

import os

project_folder = './pytorch-birds'
os.makedirs(project_folder, exist_ok=True)

Download training data

The dataset we will use (located on a public blob here as a zip file) consists of about 120 training images each for turkeys and chickens, with 100 validation images for each class. The images are a subset of the Open Images v5 Dataset. We will download and extract the dataset as part of our training script pytorch_train.py

Prepare training script

Now you will need to create your training script. In this tutorial, the training script is already provided for you at pytorch_train.py. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

However, if you would like to use Azure ML's tracking and metrics capabilities, you will have to add a small amount of Azure ML code inside your training script.

In pytorch_train.py, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML Run object within the script:

from azureml.core.run import Run
run = Run.get_context()

Further within pytorch_train.py, we log the learning rate and momentum parameters, and the best validation accuracy the model achieves:

run.log('lr', np.float(learning_rate))
run.log('momentum', np.float(momentum))

run.log('best_val_acc', np.float(best_acc))

These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Once your script is ready, copy the training script pytorch_train.py into your project directory.

import shutil

shutil.copy('pytorch_train.py', project_folder)

Create an experiment

Create an Experiment to track all the runs in your workspace for this transfer learning PyTorch tutorial.

from azureml.core import Experiment

experiment_name = 'pytorch-birds'
experiment = Experiment(ws, name=experiment_name)

Create an environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.

%%writefile conda_dependencies.yml

channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - torch==1.6.0
  - torchvision==0.7.0
  - future==0.17.1
  - pillow

from azureml.core import Environment

pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-gpu', file_path = './conda_dependencies.yml')

# Specify a GPU base image
pytorch_env.docker.enabled = True
pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

Configure the training job

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. The following code will configure a single-node PyTorch job.

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder,
                      script='pytorch_train.py',
                      arguments=['--num_epochs', 30, '--output_dir', './outputs'],
                      compute_target=compute_target,
                      environment=pytorch_env)

Submit job

Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.

run = experiment.submit(src)
print(run)

# to get more details of your run
print(run.get_details())

Monitor your run

You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

from azureml.widgets import RunDetails

RunDetails(run).show()

Alternatively, you can block until the script has completed training before running more code.

run.wait_for_completion(show_output=True)

Tune model hyperparameters

Now that we've seen how to do a simple PyTorch training run using the SDK, let's see if we can further improve the accuracy of our model. We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities.

Start a hyperparameter sweep

First, we will define the hyperparameter space to sweep over. Since our training script uses a learning rate schedule to decay the learning rate every several epochs, let's tune the initial learning rate and the momentum parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (best_val_acc).

Then, we specify the early termination policy to use to early terminate poorly performing runs. Here we use the BanditPolicy, which will terminate any run that doesn't fall within the slack factor of our primary evaluation metric. In this tutorial, we will apply this policy every epoch (since we report our best_val_acc metric every epoch and evaluation_interval=1). Notice we will delay the first policy evaluation until after the first 10 epochs (delay_evaluation=10). Refer here for more information on the BanditPolicy and other policies available.

from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, PrimaryMetricGoal

param_sampling = RandomParameterSampling( {
        'learning_rate': uniform(0.0005, 0.005),
        'momentum': uniform(0.9, 0.99)
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)

hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name='best_val_acc',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=8,
                                     max_concurrent_runs=4)

Finally, lauch the hyperparameter tuning job.

# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)

Monitor HyperDrive runs

You can monitor the progress of the runs with the following Jupyter widget.

RunDetails(hyperdrive_run).show()

Or block until the HyperDrive sweep has completed:

hyperdrive_run.wait_for_completion(show_output=True)

assert(hyperdrive_run.get_status() == "Completed")

Warm start a Hyperparameter Tuning experiment and resuming child runs

Often times, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, so as to improve the primary metric. Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs.

Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed hyperparameter tuning parent runs.

Additionally, there might be occasions when individual training runs of a hyperparameter tuning experiment are cancelled due to budget constraints or fail due to other reasons. It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). Resuming an individual training run will use the same hyperparameter configuration and mount the storage used for that run. The training script should accept the "--resume-from" argument, which contains the checkpoint or model files from which to resume the training run. You can also resume individual runs as part of an experiment that spends additional budget on hyperparameter tuning. Any additional budget, after resuming the specified training runs is used for exploring additional configurations.

For more information on warm starting and resuming hyperparameter tuning runs, please refer to the Hyperparameter Tuning for Azure Machine Learning documentation

Find and register the best model

Once all the runs complete, we can find the run that produced the model with the highest accuracy.

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)

print('Best Run is:\n  Validation accuracy: {0:.5f} \n  Learning rate: {1:.5f} \n  Momentum: {2:.5f}'.format(
        best_run_metrics['best_val_acc'][-1],
        best_run_metrics['lr'],
        best_run_metrics['momentum'])
     )

Finally, register the model from your best-performing run to your workspace. The model_path parameter takes in the relative path on the remote VM to the model file in your outputs directory. In the next section, we will deploy this registered model as a web service.

model = best_run.register_model(model_name = 'pytorch-birds', model_path = 'outputs/model.pt')
print(model.name, model.id, model.version, sep = '\t')

Deploy model as web service

Once you have your trained model, you can deploy the model on Azure. In this tutorial, we will deploy the model as a web service in Azure Container Instances (ACI). For more information on deploying models using Azure ML, refer here.

Create scoring script

First, we will create a scoring script that will be invoked by the web service call. Note that the scoring script must have two required functions:

init(): In this function, you typically load the model into a global object. This function is executed only once when the Docker container is started.
run(input_data): In this function, the model is used to predict a value based on the input data. The input and output typically use JSON as serialization and deserialization format, but you are not limited to that.

Refer to the scoring script pytorch_score.py for this tutorial. Our web service will use this file to predict whether an image is a chicken or a turkey. When writing your own scoring script, don't forget to test it locally first before you go and deploy the web service.

Define the environment

Then, we will need to create an Azure ML environment that specifies all of the scoring script's package dependencies. In this tutorial, we will reuse the same environment, pytorch_env, that we created for training.

Deploy to ACI container

We are ready to deploy. Create an inference configuration which gives specifies the inferencing environment and scripts. Create a deployment configuration file to specify the number of CPUs and gigabytes of RAM needed for your ACI container. While it depends on your model, the default of 1 core and 1 gigabyte of RAM is usually sufficient for many models. This cell will run for about 7-8 minutes.

from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice
from azureml.core.model import Model

inference_config = InferenceConfig(entry_script="pytorch_score.py", environment=pytorch_env)

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={'data': 'birds',  'method':'transfer learning', 'framework':'pytorch'},
                                               description='Classify turkey/chickens using transfer learning with PyTorch')

service = Model.deploy(workspace=ws, 
                           name='aci-birds', 
                           models=[model], 
                           inference_config=inference_config, 
                           deployment_config=aciconfig)
service.wait_for_deployment(True)
print(service.state)

If your deployment fails for any reason and you need to redeploy, make sure to delete the service before you do so: service.delete()

Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:

service.get_logs()

Get the web service's HTTP endpoint, which accepts REST client calls. This endpoint can be shared with anyone who wants to test the web service or integrate it into an application.

print(service.scoring_uri)

Test the web service

Finally, let's test our deployed web service. We will send the data as a JSON string to the web service hosted in ACI and use the SDK's run API to invoke the service. Here we will take an image from our validation data to predict on.

import json
from PIL import Image
import matplotlib.pyplot as plt

%matplotlib inline
plt.imshow(Image.open('test_img.jpg'))

import torch
from torchvision import transforms
    
def preprocess(image_file):
    """Preprocess the input image."""
    data_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])

    image = Image.open(image_file)
    image = data_transforms(image).float()
    image = torch.tensor(image)
    image = image.unsqueeze(0)
    return image.numpy()

input_data = preprocess('test_img.jpg')
result = service.run(input_data=json.dumps({'data': input_data.tolist()}))
print(result)

Clean up

Once you no longer need the web service, you can delete it with a simple API call.

service.delete()

一米田/MachineLearningNotebooks

Train, hyperparameter tune, and deploy with PyTorch

Prerequisites

Diagnostics

Initialize workspace

Create or Attach existing AmlCompute

Train model on the remote compute

Create a project directory

Download training data

Prepare training script

Create an experiment

Create an environment

Configure the training job

Submit job

Monitor your run

Tune model hyperparameters

Start a hyperparameter sweep

Monitor HyperDrive runs

Warm start a Hyperparameter Tuning experiment and resuming child runs

Find and register the best model

Deploy model as web service

Create scoring script

Define the environment

Deploy to ACI container

Test the web service

Clean up

简介

发行版

贡献者

近期动态

一米田/MachineLearningNotebooks .gitee-modal { width: 500px !important; }

Train, hyperparameter tune, and deploy with PyTorch

Prerequisites

Diagnostics

Initialize workspace

Create or Attach existing AmlCompute

Train model on the remote compute

Create a project directory

Download training data

Prepare training script

Create an experiment

Create an environment

Configure the training job

Submit job

Monitor your run

Tune model hyperparameters

Start a hyperparameter sweep

Monitor HyperDrive runs

Warm start a Hyperparameter Tuning experiment and resuming child runs

Find and register the best model

Deploy model as web service

Create scoring script

Define the environment

Deploy to ACI container

Test the web service

Clean up

简介

发行版

贡献者

近期动态

搜索帮助

一米田/MachineLearningNotebooks