1 Star 0 Fork 0

一米田/MachineLearningNotebooks

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
train-on-amlcompute.ipynb 18.02 KB
一键复制 编辑 原始数据 按行查看 历史

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

Impressions

Train using Azure Machine Learning Compute

  • Initialize a Workspace
  • Create an Experiment
  • Introduction to AmlCompute
  • Submit an AmlCompute run in a few different ways
    • Provision as a persistent compute target (Basic)
    • Provision as a persistent compute target (Advanced)
  • Additional operations to perform on AmlCompute
  • Find the best model in the run

Prerequisites

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the configuration Notebook first if you haven't already to establish your connection to the AzureML Workspace.

# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

Initialize a Workspace

Initialize a workspace object from persisted configuration

from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

Create An Experiment

Experiment is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

from azureml.core import Experiment
experiment_name = 'train-on-amlcompute'
experiment = Experiment(workspace = ws, name = experiment_name)

Introduction to AmlCompute

Azure Machine Learning Compute is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created within your workspace region and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user.

Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service.

For more information on Azure Machine Learning Compute, please read this article

If you are an existing BatchAI customer who is migrating to Azure Machine Learning, please read this article

Note: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read this article on the default limits and how to request more quota.

The training script train.py is already created for you. Let's have a look.

Submit an AmlCompute run in a few different ways

First lets check which VM families are available in your region. Azure is a regional service and some specialized SKUs (especially GPUs) are only available in certain regions. Since AmlCompute is created in the region of your workspace, we will use the supported_vms () function to see if the VM family we want to use ('STANDARD_D2_V2') is supported.

You can also pass a different region to check availability and then re-create your workspace in that region through the configuration notebook.

from azureml.core.compute import ComputeTarget, AmlCompute

AmlCompute.supported_vmsizes(workspace = ws)
#AmlCompute.supported_vmsizes(workspace = ws, location='southcentralus')

Create project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on

import os
import shutil

project_folder = './train-on-amlcompute'
os.makedirs(project_folder, exist_ok=True)
shutil.copy('train.py', project_folder)

Create environment

Create Docker based environment with scikit-learn installed.

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

myenv = Environment("myenv")

myenv.docker.enabled = True
myenv.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn', 'packaging'])

Provision as a persistent compute target (Basic)

You can provision a persistent AmlCompute resource by simply defining two parameters thanks to smart defaults. By default it autoscales from 0 nodes and provisions dedicated VMs to run your job in a container. This is useful when you want to continously re-use the same target, debug it between jobs or simply share the resource with other users of your workspace.

  • vm_size: VM family of the nodes provisioned by AmlCompute. Simply choose from the supported_vmsizes() above
  • max_nodes: Maximum nodes to autoscale to while running a job on AmlCompute
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Configure & Run

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder, 
                      script='train.py', 
                      compute_target=cpu_cluster, 
                      environment=myenv)
 
run = experiment.submit(config=src)
run

Note: if you need to cancel a run, you can follow these instructions.

%%time
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)
run.get_metrics()

Provision as a persistent compute target (Advanced)

You can also specify additional properties or change defaults while provisioning AmlCompute using a more advanced configuration. This is useful when you want a dedicated cluster of 4 nodes (for example you can set the min_nodes and max_nodes to 4), or want the compute to be within an existing VNet in your subscription.

In addition to vm_size and max_nodes, you can specify:

  • min_nodes: Minimum nodes (default 0 nodes) to downscale to while running a job on AmlCompute
  • vm_priority: Choose between 'dedicated' (default) and 'lowpriority' VMs when provisioning AmlCompute. Low Priority VMs use Azure's excess capacity and are thus cheaper but risk your run being pre-empted
  • idle_seconds_before_scaledown: Idle time (default 120 seconds) to wait after run completion before auto-scaling to min_nodes
  • vnet_resourcegroup_name: Resource group of the existing VNet within which AmlCompute should be provisioned
  • vnet_name: Name of VNet
  • subnet_name: Name of SubNet within the VNet
  • admin_username: Name of Admin user account which will be created on all the nodes of the cluster
  • admin_user_password: Password that you want to set for the user account above
  • admin_user_ssh_key: SSH Key for the user account above. You can specify either a password or an SSH key or both
  • remote_login_port_public_access: Flag to enable or disable the public SSH port. If you dont specify, AmlCompute will smartly close the port when deploying inside a VNet
  • identity_type: Compute Identity type that you want to set on the cluster, which can either be SystemAssigned or UserAssigned
  • identity_id: Resource ID of identity in case it is a UserAssigned identity, optional otherwise
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           vm_priority='lowpriority',
                                                           min_nodes=2,
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown='300',
                                                           vnet_resourcegroup_name='<my-resource-group>',
                                                           vnet_name='<my-vnet-name>',
                                                           subnet_name='<my-subnet-name>',
                                                           admin_username='<my-username>',
                                                           admin_user_password='<my-password>',
                                                           admin_user_ssh_key='<my-sshkey>',
                                                           remote_login_port_public_access='enabled',
                                                           identity_type='UserAssigned',
                                                           identity_id=['<resource-id1>'])
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Configure & Run

# Set compute target to the one created in previous step
src.run_config.target = cpu_cluster
 
run = experiment.submit(config=src)
run
%%time
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)
run.get_metrics()

Additional operations to perform on AmlCompute

You can perform more operations on AmlCompute such as updating the node counts or deleting the compute.

#get_status () gets the latest status of the AmlCompute target
cpu_cluster.get_status().serialize()
#list_nodes () gets the list of nodes on the cluster with status, IP and associated run
cpu_cluster.list_nodes()
#Update () takes in the min_nodes, max_nodes and idle_seconds_before_scaledown and updates the AmlCompute target
#cpu_cluster.update(min_nodes=1)
#cpu_cluster.update(max_nodes=10)
cpu_cluster.update(idle_seconds_before_scaledown=300)
#cpu_cluster.update(min_nodes=2, max_nodes=4, idle_seconds_before_scaledown=600)
#Delete () is used to deprovision and delete the AmlCompute target. Useful if you want to re-use the compute name 
#'cpu-cluster' in this case but use a different VM family for instance.

#cpu_cluster.delete()

Success!

Great, you are ready to move on to the remaining notebooks.

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/Pan13725490013/MachineLearningNotebooks.git
git@gitee.com:Pan13725490013/MachineLearningNotebooks.git
Pan13725490013
MachineLearningNotebooks
MachineLearningNotebooks
master

搜索帮助