--- title: "Hyperparameter tune a Keras model" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Hyperparameter tune a Keras model} %\VignetteEngine{knitr::rmarkdown} \use_package{UTF-8} --- This tutorial demonstrates how you can efficiently tune hyperparameters for a model using HyperDrive, Azure ML's hyperparameter tuning functionality. You will train a Keras model on the CIFAR10 dataset, automate hyperparameter exploration, launch parallel jobs, log your results, and find the best run. ### What are hyperparameters? Hyperparameters are variable parameters chosen to train a model. Learning rate, number of epochs, and batch size are all examples of hyperparameters. Using brute-force methods to find the optimal values for parameters can be time-consuming, and poor-performing runs can result in wasted money. To avoid this, HyperDrive automates hyperparameter exploration in a time-saving and cost-effective manner by launching several parallel runs with different configurations and finding the configuration that results in best performance on your primary metric. Let's get started with the example to see how it works! ## Prerequisites If you don’t have access to an Azure ML workspace, follow the [setup tutorial](https://azure.github.io/azureml-sdk-for-r/articles/configuration.html) to configure and create a workspace. ## Set up development environment The setup for your development work in this tutorial includes the following actions: * Import required packages * Connect to a workspace * Create an experiment to track your runs * Create a remote compute target to use for training ### Import **azuremlsdk** package ```{r eval=FALSE} library(azuremlsdk) ``` ### Load your workspace Instantiate a workspace object from your existing workspace. The following code will load the workspace details from a **config.json** file if you previously wrote one out with [`write_workspace_config()`](https://azure.github.io/azureml-sdk-for-r/reference/write_workspace_config.html). ```{r load_workpace, eval=FALSE} ws <- load_workspace_from_config() ``` Or, you can retrieve a workspace by directly specifying your workspace details: ```{r get_workpace, eval=FALSE} ws <- get_workspace("", "", "") ``` ### Create an experiment An Azure ML **experiment** tracks a grouping of runs, typically from the same training script. Create an experiment to track hyperparameter tuning runs for the Keras model. ```{r create_experiment, eval=FALSE} exp <- experiment(workspace = ws, name = 'hyperdrive-cifar10') ``` If you would like to track your runs in an existing experiment, simply specify that experiment's name to the `name` parameter of `experiment()`. ### Create a compute target By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. In this tutorial, you create a GPU-enabled cluster as your training environment. The code below creates the compute cluster for you if it doesn't already exist in your workspace. You may need to wait a few minutes for your compute cluster to be provisioned if it doesn't already exist. ```{r create_cluster, eval=FALSE} cluster_name <- "gpucluster" compute_target <- get_compute(ws, cluster_name = cluster_name) if (is.null(compute_target)) { vm_size <- "STANDARD_NC6" compute_target <- create_aml_compute(workspace = ws, cluster_name = cluster_name, vm_size = vm_size, max_nodes = 4) wait_for_provisioning_completion(compute_target, show_output = TRUE) } ``` ## Prepare the training script A training script called `cifar10_cnn.R` has been provided for you in the `hyperparameter-tune-with-keras` folder. In order to leverage HyperDrive, the training script for your model must log the relevant metrics during model training. When you configure the hyperparameter tuning run, you specify the primary metric to use for evaluating run performance. You must log this metric so it is available to the hyperparameter tuning process. In order to log the required metrics, you need to do the following **inside the training script**: * Import the **azuremlsdk** package ``` library(azuremlsdk) ``` * Take the hyperparameters as command-line arguments to the script. This is necessary so that when HyperDrive carries out the hyperparameter sweep, it can run the training script with different values to the hyperparameters as defined by the search space. * Use the [`log_metric_to_run()`](https://azure.github.io/azureml-sdk-for-r/reference/log_metric_to_run.html) function to log the hyperparameters and the primary metric. ``` log_metric_to_run("batch_size", batch_size) ... log_metric_to_run("epochs", epochs) ... log_metric_to_run("lr", lr) ... log_metric_to_run("decay", decay) ... log_metric_to_run("Loss", results[[1]]) ``` ## Create an estimator An Azure ML **estimator** encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. The estimator is used to define the configuration for each of the child runs that the parent HyperDrive run will kick off. To create the estimator, define the following: * The directory that contains your scripts needed for training (`source_directory`). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required. * The training script that will be executed (`entry_script`). * The compute target (`compute_target`), in this case the AmlCompute cluster you created earlier. * Any environment dependencies required for training. For full control over your training environment (instead of using the defaults), you can create a custom Docker image to use for your remote run, which is what we've done in this example. The Docker image includes the necessary packages for Keras GPU training. The Dockerfile used to build the image is included in the `hyperparameter-tune-with-keras/` folder for reference. See the [`r_environment()`](https://azure.github.io/azureml-sdk-for-r/reference/r_environment.html) reference for the full set of configurable options. ```{r create_estimator, eval=FALSE} env <- r_environment("keras-env", custom_docker_image = "amlsamples/r-keras:latest") est <- estimator(source_directory = "hyperparameter-tune-with-keras", entry_script = "cifar10_cnn.R", compute_target = compute_target, environment = env) ``` ## Configure the HyperDrive run To kick off hyperparameter tuning in Azure ML, you will need to configure a HyperDrive run, which will in turn launch individual children runs of the training scripts with the corresponding hyperparameter values. ### Define search space In this experiment, we will use four hyperparameters: batch size, number of epochs, learning rate, and decay. In order to begin tuning, we must define the range of values we would like to explore from and how they will be distributed. This is called a parameter space definition and can be created with discrete or continuous ranges. __Discrete hyperparameters__ are specified as a choice among discrete values represented as a list. Advanced discrete hyperparameters can also be specified using a distribution. The following distributions are supported: * `quniform(low, high, q)` * `qloguniform(low, high, q)` * `qnormal(mu, sigma, q)` * `qlognormal(mu, sigma, q)` __Continuous hyperparameters__ are specified as a distribution over a continuous range of values. The following distributions are supported: * `uniform(low, high)` * `loguniform(low, high)` * `normal(mu, sigma)` * `lognormal(mu, sigma)` Here, we will use the [`random_parameter_sampling()`](https://azure.github.io/azureml-sdk-for-r/reference/random_parameter_sampling.html) function to define the search space for each hyperparameter. `batch_size` and `epochs` will be chosen from discrete sets while `lr` and `decay` will be drawn from continuous distributions. If you wish to fix a script parameter's value, simply remove it from your sampling function list, and it will be excluded from tuning and kept at the value assigned to it in the estimator step. Other available sampling function options are: * [`grid_parameter_sampling()`](https://azure.github.io/azureml-sdk-for-r/reference/grid_parameter_sampling.html) * [`bayesian_parameter_sampling()`](https://azure.github.io/azureml-sdk-for-r/reference/bayesian_parameter_sampling.html) ```{r search_space, eval=FALSE} sampling <- random_parameter_sampling(list(batch_size = choice(c(16, 32, 64)), epochs = choice(c(200, 350, 500)), lr = normal(0.0001, 0.005), decay = uniform(1e-6, 3e-6))) ``` ### Define termination policy To prevent resource waste, Azure ML can detect and terminate poorly performing runs. HyperDrive will do this automatically if you specify an early termination policy. Here, you will use the [`bandit_policy()`](https://azure.github.io/azureml-sdk-for-r/reference/bandit_policy.html), which terminates any runs where the primary metric is not within the specified slack factor with respect to the best performing training run. ```{r termination_policy, eval=FALSE} policy <- bandit_policy(slack_factor = 0.15) ``` Other termination policy options are: * [`median_stopping_policy()`](https://azure.github.io/azureml-sdk-for-r/reference/median_stopping_policy.html) * [`truncation_selection_policy()`](https://azure.github.io/azureml-sdk-for-r/reference/truncation_selection_policy.html) If no policy is provided, all runs will continue to completion regardless of performance. ### Finalize configuration Now, you can create a `HyperDriveConfig` object to define your HyperDrive run. Along with the sampling and policy definitions, you need to specify the name of the primary metric that you want to track and whether we want to maximize it or minimize it. The `primary_metric_name` must correspond with the name of the primary metric you logged in your training script. `max_total_runs` specifies the total number of child runs to launch. See the [hyperdrive_config()](https://azure.github.io/azureml-sdk-for-r/reference/hyperdrive_config.html) reference for the full set of configurable parameters. ```{r create_config, eval=FALSE} hyperdrive_config <- hyperdrive_config(hyperparameter_sampling = sampling, primary_metric_goal("MINIMIZE"), primary_metric_name = "Loss", max_total_runs = 8, policy = policy, estimator = est) ``` ## Submit the HyperDrive run Finally submit the experiment to run on your cluster. The parent HyperDrive run will launch the individual child runs. `submit_experiment()` will return a `HyperDriveRun` object that you will use to interface with the run. In this tutorial, since the cluster we created scales to a max of `4` nodes, all 4 child runs will be launched in parallel. ```{r submit_run, eval=FALSE} hyperdrive_run <- submit_experiment(exp, hyperdrive_config) ``` You can view the HyperDrive run’s details as a table. Clicking the “Web View” link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI. ```{r eval=FALSE} plot_run_details(hyperdrive_run) ``` Wait until hyperparameter tuning is complete before you run more code. ```{r eval=FALSE} wait_for_run_completion(hyperdrive_run, show_output = TRUE) ``` ## Analyse runs by performance Finally, you can view and compare the metrics collected during all of the child runs! ```{r analyse_runs, eval=FALSE} # Get the metrics of all the child runs child_run_metrics <- get_child_run_metrics(hyperdrive_run) child_run_metrics # Get the child run objects sorted in descending order by the best primary metric child_runs <- get_child_runs_sorted_by_primary_metric(hyperdrive_run) child_runs # Directly get the run object of the best performing run best_run <- get_best_run_by_primary_metric(hyperdrive_run) # Get the metrics of the best performing run metrics <- get_run_metrics(best_run) metrics ``` The `metrics` variable will include the values of the hyperparameters that resulted in the best performing run. ## Clean up resources Delete the resources once you no longer need them. Don't delete any resource you plan to still use. Delete the compute cluster: ```{r delete_compute, eval=FALSE} delete_compute(compute_target) ```