Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
In this tutorial, you will train, hyperparameter tune, and deploy a PyTorch model using the Azure Machine Learning (Azure ML) Python SDK.
This tutorial will train an image classification model using transfer learning, based on PyTorch's Transfer Learning tutorial. The model is trained to classify chickens and turkeys by first using a pretrained ResNet18 model that has been trained on the ImageNet dataset.
Workspace
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)
Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config()
creates a workspace object from the details stored in config.json
.
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
'Azure region: ' + ws.location,
'Subscription id: ' + ws.subscription_id,
'Resource group: ' + ws.resource_group, sep='\n')
You will need to create a compute target for training your model. In this tutorial, we use Azure ML managed compute (AmlCompute) for our remote training compute resource.
Creation of AmlCompute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace, this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# choose a name for your cluster
cluster_name = "gpu-cluster"
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target.')
except ComputeTargetException:
print('Creating a new compute target...')
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
max_nodes=4)
# create the cluster
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)
# use get_status() to get a detailed status for the current cluster.
print(compute_target.get_status().serialize())
The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the vm_size
parameter, such as STANDARD_D2_V2
.
Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time.
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.
import os
project_folder = './pytorch-birds'
os.makedirs(project_folder, exist_ok=True)
The dataset we will use (located on a public blob here as a zip file) consists of about 120 training images each for turkeys and chickens, with 100 validation images for each class. The images are a subset of the Open Images v5 Dataset. We will download and extract the dataset as part of our training script pytorch_train.py
Now you will need to create your training script. In this tutorial, the training script is already provided for you at pytorch_train.py
. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.
However, if you would like to use Azure ML's tracking and metrics capabilities, you will have to add a small amount of Azure ML code inside your training script.
In pytorch_train.py
, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML Run
object within the script:
from azureml.core.run import Run
run = Run.get_context()
Further within pytorch_train.py
, we log the learning rate and momentum parameters, and the best validation accuracy the model achieves:
run.log('lr', np.float(learning_rate))
run.log('momentum', np.float(momentum))
run.log('best_val_acc', np.float(best_acc))
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.
Once your script is ready, copy the training script pytorch_train.py
into your project directory.
import shutil
shutil.copy('pytorch_train.py', project_folder)
Create an Experiment to track all the runs in your workspace for this transfer learning PyTorch tutorial.
from azureml.core import Experiment
experiment_name = 'pytorch-birds'
experiment = Experiment(ws, name=experiment_name)
Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.
%%writefile conda_dependencies.yml
channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
- azureml-defaults
- torch==1.6.0
- torchvision==0.7.0
- future==0.17.1
- pillow
from azureml.core import Environment
pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-gpu', file_path = './conda_dependencies.yml')
# Specify a GPU base image
pytorch_env.docker.enabled = True
pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. The following code will configure a single-node PyTorch job.
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory=project_folder,
script='pytorch_train.py',
arguments=['--num_epochs', 30, '--output_dir', './outputs'],
compute_target=compute_target,
environment=pytorch_env)
Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.
run = experiment.submit(src)
print(run)
# to get more details of your run
print(run.get_details())
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.
from azureml.widgets import RunDetails
RunDetails(run).show()
Alternatively, you can block until the script has completed training before running more code.
run.wait_for_completion(show_output=True)
Now that we've seen how to do a simple PyTorch training run using the SDK, let's see if we can further improve the accuracy of our model. We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities.
First, we will define the hyperparameter space to sweep over. Since our training script uses a learning rate schedule to decay the learning rate every several epochs, let's tune the initial learning rate and the momentum parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, the best validation accuracy (best_val_acc
).
Then, we specify the early termination policy to use to early terminate poorly performing runs. Here we use the BanditPolicy
, which will terminate any run that doesn't fall within the slack factor of our primary evaluation metric. In this tutorial, we will apply this policy every epoch (since we report our best_val_acc
metric every epoch and evaluation_interval=1
). Notice we will delay the first policy evaluation until after the first 10
epochs (delay_evaluation=10
).
Refer here for more information on the BanditPolicy and other policies available.
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, PrimaryMetricGoal
param_sampling = RandomParameterSampling( {
'learning_rate': uniform(0.0005, 0.005),
'momentum': uniform(0.9, 0.99)
}
)
early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)
hyperdrive_config = HyperDriveConfig(run_config=src,
hyperparameter_sampling=param_sampling,
policy=early_termination_policy,
primary_metric_name='best_val_acc',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=8,
max_concurrent_runs=4)
Finally, lauch the hyperparameter tuning job.
# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)
RunDetails(hyperdrive_run).show()
Or block until the HyperDrive sweep has completed:
hyperdrive_run.wait_for_completion(show_output=True)
assert(hyperdrive_run.get_status() == "Completed")
Often times, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, so as to improve the primary metric. Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs.
Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed hyperparameter tuning parent runs.
Additionally, there might be occasions when individual training runs of a hyperparameter tuning experiment are cancelled due to budget constraints or fail due to other reasons. It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). Resuming an individual training run will use the same hyperparameter configuration and mount the storage used for that run. The training script should accept the "--resume-from" argument, which contains the checkpoint or model files from which to resume the training run. You can also resume individual runs as part of an experiment that spends additional budget on hyperparameter tuning. Any additional budget, after resuming the specified training runs is used for exploring additional configurations.
For more information on warm starting and resuming hyperparameter tuning runs, please refer to the Hyperparameter Tuning for Azure Machine Learning documentation
Once all the runs complete, we can find the run that produced the model with the highest accuracy.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)
print('Best Run is:\n Validation accuracy: {0:.5f} \n Learning rate: {1:.5f} \n Momentum: {2:.5f}'.format(
best_run_metrics['best_val_acc'][-1],
best_run_metrics['lr'],
best_run_metrics['momentum'])
)
Finally, register the model from your best-performing run to your workspace. The model_path
parameter takes in the relative path on the remote VM to the model file in your outputs
directory. In the next section, we will deploy this registered model as a web service.
model = best_run.register_model(model_name = 'pytorch-birds', model_path = 'outputs/model.pt')
print(model.name, model.id, model.version, sep = '\t')
Once you have your trained model, you can deploy the model on Azure. In this tutorial, we will deploy the model as a web service in Azure Container Instances (ACI). For more information on deploying models using Azure ML, refer here.
First, we will create a scoring script that will be invoked by the web service call. Note that the scoring script must have two required functions:
init()
: In this function, you typically load the model into a global
object. This function is executed only once when the Docker container is started.run(input_data)
: In this function, the model is used to predict a value based on the input data. The input and output typically use JSON as serialization and deserialization format, but you are not limited to that.Refer to the scoring script pytorch_score.py
for this tutorial. Our web service will use this file to predict whether an image is a chicken or a turkey. When writing your own scoring script, don't forget to test it locally first before you go and deploy the web service.
Then, we will need to create an Azure ML environment that specifies all of the scoring script's package dependencies. In this tutorial, we will reuse the same environment, pytorch_env
, that we created for training.
We are ready to deploy. Create an inference configuration which gives specifies the inferencing environment and scripts. Create a deployment configuration file to specify the number of CPUs and gigabytes of RAM needed for your ACI container. While it depends on your model, the default of 1
core and 1
gigabyte of RAM is usually sufficient for many models. This cell will run for about 7-8 minutes.
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice
from azureml.core.model import Model
inference_config = InferenceConfig(entry_script="pytorch_score.py", environment=pytorch_env)
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
memory_gb=1,
tags={'data': 'birds', 'method':'transfer learning', 'framework':'pytorch'},
description='Classify turkey/chickens using transfer learning with PyTorch')
service = Model.deploy(workspace=ws,
name='aci-birds',
models=[model],
inference_config=inference_config,
deployment_config=aciconfig)
service.wait_for_deployment(True)
print(service.state)
If your deployment fails for any reason and you need to redeploy, make sure to delete the service before you do so: service.delete()
Tip: If something goes wrong with the deployment, the first thing to look at is the logs from the service by running the following command:
service.get_logs()
Get the web service's HTTP endpoint, which accepts REST client calls. This endpoint can be shared with anyone who wants to test the web service or integrate it into an application.
print(service.scoring_uri)
Finally, let's test our deployed web service. We will send the data as a JSON string to the web service hosted in ACI and use the SDK's run
API to invoke the service. Here we will take an image from our validation data to predict on.
import json
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(Image.open('test_img.jpg'))
import torch
from torchvision import transforms
def preprocess(image_file):
"""Preprocess the input image."""
data_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
image = Image.open(image_file)
image = data_transforms(image).float()
image = torch.tensor(image)
image = image.unsqueeze(0)
return image.numpy()
input_data = preprocess('test_img.jpg')
result = service.run(input_data=json.dumps({'data': input_data.tolist()}))
print(result)
service.delete()
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。