Hyperparameter tune a Keras model

This tutorial demonstrates how you can efficiently tune hyperparameters for a model using HyperDrive, Azure ML’s hyperparameter tuning functionality. You will train a Keras model on the CIFAR10 dataset, automate hyperparameter exploration, launch parallel jobs, log your results, and find the best run.

What are hyperparameters?

Hyperparameters are variable parameters chosen to train a model. Learning rate, number of epochs, and batch size are all examples of hyperparameters.

Using brute-force methods to find the optimal values for parameters can be time-consuming, and poor-performing runs can result in wasted money. To avoid this, HyperDrive automates hyperparameter exploration in a time-saving and cost-effective manner by launching several parallel runs with different configurations and finding the configuration that results in best performance on your primary metric.

Let’s get started with the example to see how it works!

Prerequisites

If you don’t have access to an Azure ML workspace, follow the setup tutorial to configure and create a workspace.

Set up development environment

The setup for your development work in this tutorial includes the following actions:

  • Import required packages
  • Connect to a workspace
  • Create an experiment to track your runs
  • Create a remote compute target to use for training

Import azuremlsdk package

library(azuremlsdk)

Load your workspace

Instantiate a workspace object from your existing workspace. The following code will load the workspace details from a config.json file if you previously wrote one out with write_workspace_config().

ws <- load_workspace_from_config()

Or, you can retrieve a workspace by directly specifying your workspace details:

ws <- get_workspace("<your workspace name>", "<your subscription ID>", "<your resource group>")

Create an experiment

An Azure ML experiment tracks a grouping of runs, typically from the same training script. Create an experiment to track hyperparameter tuning runs for the Keras model.

exp <- experiment(workspace = ws, name = 'hyperdrive-cifar10')

If you would like to track your runs in an existing experiment, simply specify that experiment’s name to the name parameter of experiment().

Create a compute target

By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. In this tutorial, you create a GPU-enabled cluster as your training environment. The code below creates the compute cluster for you if it doesn’t already exist in your workspace.

You may need to wait a few minutes for your compute cluster to be provisioned if it doesn’t already exist.

cluster_name <- "gpucluster"

compute_target <- get_compute(ws, cluster_name = cluster_name)
if (is.null(compute_target))
{
  vm_size <- "STANDARD_NC6"
  compute_target <- create_aml_compute(workspace = ws, 
                                       cluster_name = cluster_name,
                                       vm_size = vm_size, 
                                       max_nodes = 4)
  
  wait_for_provisioning_completion(compute_target, show_output = TRUE)
}

Prepare the training script

A training script called cifar10_cnn.R has been provided for you in the hyperparameter-tune-with-keras folder.

In order to leverage HyperDrive, the training script for your model must log the relevant metrics during model training. When you configure the hyperparameter tuning run, you specify the primary metric to use for evaluating run performance. You must log this metric so it is available to the hyperparameter tuning process.

In order to log the required metrics, you need to do the following inside the training script:

  • Import the azuremlsdk package
library(azuremlsdk)
  • Take the hyperparameters as command-line arguments to the script. This is necessary so that when HyperDrive carries out the hyperparameter sweep, it can run the training script with different values to the hyperparameters as defined by the search space.

  • Use the log_metric_to_run() function to log the hyperparameters and the primary metric.

log_metric_to_run("batch_size", batch_size)
...
log_metric_to_run("epochs", epochs)
...
log_metric_to_run("lr", lr)
...
log_metric_to_run("decay", decay)
...
log_metric_to_run("Loss", results[[1]])

Create an estimator

An Azure ML estimator encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. The estimator is used to define the configuration for each of the child runs that the parent HyperDrive run will kick off.

To create the estimator, define the following:

  • The directory that contains your scripts needed for training (source_directory). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required.
  • The training script that will be executed (entry_script).
  • The compute target (compute_target), in this case the AmlCompute cluster you created earlier.
  • Any environment dependencies required for training. For full control over your training environment (instead of using the defaults), you can create a custom Docker image to use for your remote run, which is what we’ve done in this example. The Docker image includes the necessary packages for Keras GPU training. The Dockerfile used to build the image is included in the hyperparameter-tune-with-keras/ folder for reference. See the r_environment() reference for the full set of configurable options.
env <- r_environment("keras-env", custom_docker_image = "amlsamples/r-keras:latest")

est <- estimator(source_directory = "hyperparameter-tune-with-keras",
                 entry_script = "cifar10_cnn.R",
                 compute_target = compute_target,
                 environment = env)

Configure the HyperDrive run

To kick off hyperparameter tuning in Azure ML, you will need to configure a HyperDrive run, which will in turn launch individual children runs of the training scripts with the corresponding hyperparameter values.

Define search space

In this experiment, we will use four hyperparameters: batch size, number of epochs, learning rate, and decay. In order to begin tuning, we must define the range of values we would like to explore from and how they will be distributed. This is called a parameter space definition and can be created with discrete or continuous ranges.

Discrete hyperparameters are specified as a choice among discrete values represented as a list.

Advanced discrete hyperparameters can also be specified using a distribution. The following distributions are supported:

  • quniform(low, high, q)
  • qloguniform(low, high, q)
  • qnormal(mu, sigma, q)
  • qlognormal(mu, sigma, q)

Continuous hyperparameters are specified as a distribution over a continuous range of values. The following distributions are supported:

  • uniform(low, high)
  • loguniform(low, high)
  • normal(mu, sigma)
  • lognormal(mu, sigma)

Here, we will use the random_parameter_sampling() function to define the search space for each hyperparameter. batch_size and epochs will be chosen from discrete sets while lr and decay will be drawn from continuous distributions. If you wish to fix a script parameter’s value, simply remove it from your sampling function list, and it will be excluded from tuning and kept at the value assigned to it in the estimator step.

Other available sampling function options are:

sampling <- random_parameter_sampling(list(batch_size = choice(c(16, 32, 64)),
                                           epochs = choice(c(200, 350, 500)),
                                           lr = normal(0.0001, 0.005),
                                           decay = uniform(1e-6, 3e-6)))

Define termination policy

To prevent resource waste, Azure ML can detect and terminate poorly performing runs. HyperDrive will do this automatically if you specify an early termination policy.

Here, you will use the bandit_policy(), which terminates any runs where the primary metric is not within the specified slack factor with respect to the best performing training run.

policy <- bandit_policy(slack_factor = 0.15)

Other termination policy options are:

If no policy is provided, all runs will continue to completion regardless of performance.

Finalize configuration

Now, you can create a HyperDriveConfig object to define your HyperDrive run. Along with the sampling and policy definitions, you need to specify the name of the primary metric that you want to track and whether we want to maximize it or minimize it. The primary_metric_name must correspond with the name of the primary metric you logged in your training script. max_total_runs specifies the total number of child runs to launch. See the hyperdrive_config() reference for the full set of configurable parameters.

hyperdrive_config <- hyperdrive_config(hyperparameter_sampling = sampling,
                                       primary_metric_goal("MINIMIZE"),
                                       primary_metric_name = "Loss",
                                       max_total_runs = 8,
                                       policy = policy,
                                       estimator = est)

Submit the HyperDrive run

Finally submit the experiment to run on your cluster. The parent HyperDrive run will launch the individual child runs. submit_experiment() will return a HyperDriveRun object that you will use to interface with the run. In this tutorial, since the cluster we created scales to a max of 4 nodes, all 4 child runs will be launched in parallel.

hyperdrive_run <- submit_experiment(exp, hyperdrive_config)

You can view the HyperDrive run’s details as a table. Clicking the “Web View” link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI.

plot_run_details(hyperdrive_run)

Wait until hyperparameter tuning is complete before you run more code.

wait_for_run_completion(hyperdrive_run, show_output = TRUE)

Analyse runs by performance

Finally, you can view and compare the metrics collected during all of the child runs!

# Get the metrics of all the child runs
child_run_metrics <- get_child_run_metrics(hyperdrive_run)
child_run_metrics

# Get the child run objects sorted in descending order by the best primary metric
child_runs <- get_child_runs_sorted_by_primary_metric(hyperdrive_run)
child_runs

# Directly get the run object of the best performing run
best_run <- get_best_run_by_primary_metric(hyperdrive_run)

# Get the metrics of the best performing run
metrics <- get_run_metrics(best_run)
metrics

The metrics variable will include the values of the hyperparameters that resulted in the best performing run.

Clean up resources

Delete the resources once you no longer need them. Don’t delete any resource you plan to still use.

Delete the compute cluster:

delete_compute(compute_target)