--- title: "Train a TensorFlow model" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Train a TensorFlow model} %\VignetteEngine{knitr::rmarkdown} \use_package{UTF-8} --- This tutorial demonstrates how run a TensorFlow job at scale using Azure ML. You will train a TensorFlow model to classify handwritten digits (MNIST) using a deep neural network (DNN) and log your results to the Azure ML service. ## Prerequisites If you don’t have access to an Azure ML workspace, follow the [setup tutorial](https://azure.github.io/azureml-sdk-for-r/articles/configuration.html) to configure and create a workspace. ## Set up development environment The setup for your development work in this tutorial includes the following actions: * Import required packages * Connect to a workspace * Create an experiment to track your runs * Create a remote compute target to use for training ### Import **azuremlsdk** package ```{r eval=FALSE} library(azuremlsdk) ``` ### Load your workspace Instantiate a workspace object from your existing workspace. The following code will load the workspace details from a **config.json** file if you previously wrote one out with [`write_workspace_config()`](https://azure.github.io/azureml-sdk-for-r/reference/write_workspace_config.html). ```{r load_workpace, eval=FALSE} ws <- load_workspace_from_config() ``` Or, you can retrieve a workspace by directly specifying your workspace details: ```{r get_workpace, eval=FALSE} ws <- get_workspace("", "", "") ``` ### Create an experiment An Azure ML **experiment** tracks a grouping of runs, typically from the same training script. Create an experiment to track the runs for training the TensorFlow model on the MNIST data. ```{r create_experiment, eval=FALSE} exp <- experiment(workspace = ws, name = "tf-mnist") ``` If you would like to track your runs in an existing experiment, simply specify that experiment's name to the `name` parameter of `experiment()`. ### Create a compute target By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. In this tutorial, you create a GPU-enabled cluster as your training environment. The code below creates the compute cluster for you if it doesn't already exist in your workspace. You may need to wait a few minutes for your compute cluster to be provisioned if it doesn't already exist. ```{r create_cluster, eval=FALSE} cluster_name <- "gpucluster" compute_target <- get_compute(ws, cluster_name = cluster_name) if (is.null(compute_target)) { vm_size <- "STANDARD_NC6" compute_target <- create_aml_compute(workspace = ws, cluster_name = cluster_name, vm_size = vm_size, max_nodes = 4) wait_for_provisioning_completion(compute_target, show_output = TRUE) } ``` ## Prepare the training script A training script called `tf_mnist.R` has been provided for you in the `train-with-tensorflow/` subfolder of this vignette. The Azure ML SDK provides a set of logging APIs for logging various metrics during training runs. These metrics are recorded and persisted in the experiment run record, and can be be accessed at any time or viewed in the run details page in [Azure Machine Learning studio](http://ml.azure.com/). In order to collect and upload run metrics, you need to do the following **inside the training script**: * Import the **azuremlsdk** package ``` library(azuremlsdk) ``` * Add the [`log_metric_to_run()`](https://azure.github.io/azureml-sdk-for-r/reference/log_metric_to_run.html) function to track our primary metric, "accuracy", for this experiment. If you have your own training script with several important metrics, simply create a logging call for each one within the script. ``` log_metric_to_run("accuracy", sess$run(accuracy, feed_dict = dict(x = mnist$test$images, y_ = mnist$test$labels))) ``` See the [reference](https://azure.github.io/azureml-sdk-for-r/reference/index.html#section-training-experimentation) for the full set of logging methods `log_*()` available from the R SDK. ## Create an estimator An Azure ML **estimator** encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. To create the estimator, define the following: * The directory that contains your scripts needed for training (`source_directory`). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required. * The training script that will be executed (`entry_script`). * The compute target (`compute_target`), in this case the AmlCompute cluster you created earlier. * Any environment dependencies required for training. For full control over your training environment (instead of using the defaults), you can create a custom Docker image to use for your remote run, which is what we've done in this example. The Docker image includes the necessary packages for TensorFlow GPU training. The Dockerfile used to build the image is included in the `train-with-tensorflow/` folder for reference. See the [`r_environment()`](https://azure.github.io/azureml-sdk-for-r/reference/r_environment.html) reference for the full set of configurable options. Pass the environment object to the environment parameter in estimator. ```{r create_estimator, eval=FALSE} env <- r_environment("tensorflow-env", custom_docker_image = "amlsamples/r-tensorflow:latest") est <- estimator(source_directory = "train-with-tensorflow", entry_script = "tf_mnist.R", compute_target = compute_target, environment = env) ``` ## Submit the job Finally submit the job to run on your cluster. [`submit_experiment()`](https://azure.github.io/azureml-sdk-for-r/reference/submit_experiment.html) returns a `Run` object that you can then use to interface with the run. ```{r submit_job, eval=FALSE} run <- submit_experiment(exp, est) ``` You can view the run’s details as a table. Clicking the “Web View” link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI. ```{r eval=FALSE} plot_run_details(run) ``` Model training happens in the background. Wait until the model has finished training before you run more code. ```{r eval=FALSE} wait_for_run_completion(run, show_output = TRUE) ``` ## View run metrics Once your job has finished, you can view the metrics collected during your TensorFlow run. ```{r get_metrics, eval=FALSE} metrics <- get_run_metrics(run) metrics ``` ## Clean up resources Delete the resources once you no longer need them. Don't delete any resource you plan to still use. Delete the compute cluster: ```{r delete_compute, eval=FALSE} delete_compute(compute_target) ```