Guild AI and TensorFlow 2

guildai · July 14, 2020, 4:57pm

Overview

This example shows how to use Guild to track experiments and optimize a TensorFlow model. It highlights the use of TensorBoard and the HParams plugin to evaluate hyperparameters and find optimal values. It uses unmodified code from the official example in TensorFlow Overview.

Project files:


guild.yml	Project Guild file
beginner.py	Sample code
beginner_with_flags.py	Sample code modified to expose flags
requirements.txt	List of required libraries

This example follows the process outlines in Use Guild in a Project.

Create Virtual Environment

Start with a new virtual environment. Use guild init or another method as you prefer.

cd examples/tensorflow2

guild init

Activate the environment:

source guild-env

Run the Sample Script with Python

Before adding Guild support, verify that you can run the beginner example without errors.

python beginner.py

The command should run to completion after training a model over 5 epochs. If you see errors, resolve them first. If you need help, let us know and we’ll help.

Run the Sample Script with Guild

Run beginner.py with Guild:

guild run beginner.py

Guild runs the script to generate a run. When the operation is finished, show the run info:

guild runs info

By default shows information for the latest run.

Note the model accuracy and loss reflected in the result.

See Runs for commands you can use with runs.

Highlight Guild lets you run and track experiments with zero code change.

Expose Flags

The sample script from Google uses a number of hard-coded and implicit flag values. We want to expose these so users can modify them without editing the code.

The following script parameters should be exposed as flags:

Training epochs
Learning rate
Hidden layer activation
Dropout

We modify the script to use global variables to define these values.

github.com

guildai/guildai/blob/main/examples/tensorflow2/beginner_with_flags.py#L8


      
          import tensorflow as tf
          
          mnist = tf.keras.datasets.mnist
          
          epochs = 5
          learning_rate = 0.01
          activation = "relu"
          dropout = 0.2
          
          (x_train, y_train), (x_test, y_test) = mnist.load_data()
          x_train, x_test = x_train / 255.0, x_test / 255.0
          
          model = tf.keras.models.Sequential(
              [
                  tf.keras.layers.Flatten(input_shape=(28, 28)),
                  tf.keras.layers.Dense(128, activation=activation),
                  tf.keras.layers.Dropout(dropout),
                  tf.keras.layers.Dense(10, activation='softmax'),

With this simple change, you can use Guild to run experiments with different hyperparameters. Each run is recorded with the applicable set of flag values.

guild run beginner_with_flags.py epochs=10

Optimize Hyperparameters

Use Guild to search for optimial hyperparameters. By default, Guild tries to minimize the loss scalar. The sample script happens to log that scalar. If an operation logs something else, specify the scalar to optimize using --minimize or --maximize with guild run.

Start a run to find optimal values for learning_rate and dropout. Train over two epochs to save time.

guild run beginner_with_flags.py --optimize \
  epochs=2 \
  dropout=range[0.1:0.9:0.1] \
  learning_rate=loguniform[1e-4:1e-1]

For more information about this command, see Hyperparameter Optimization.

By default Guild runs 20 trials. Specify a different value using --max-trials.

Use guild runs to list the runs:

guild runs

By default Guild shows the latest 20 runs. To show all runs, use the --all option.

Use TensorBoard to compare runs:

guild tensorboard

Click HPARAMS and PARALLEL COORDINATES VIEW. Select Logarithmic for learning_rate. Select Quantile for accuracy, loss, and time. This is a useful view for evaluating hyperparameters. Note runs with high accuracy and short run times. These are the “best” runs. To highlight these, click-and-drag along the vertical axis to select a region. Adjust the region as needed. TensorBoard highlights runs that fall within the selected range.

With these results, we make a some observations:

This model learns quickly on the data set. We achieve solid performance with only two epochs.
Optimal dropout appears to be around 10% at least over the short training period. We could experiment with higher dropout rates over longer runs.
Optimal learning rate appears to fall between 0.001 and 0.01. This is with two epochs. We can expect the optimal value to change as we increase training.
We can match the default performance from the Google example with just two training epochs. This reduces our time and energy cost by 60%.

You can expect break-through observations with other models. This is the value of experiment tracking.

With a base line to compare against, you might explore these questions:

Can we improve the performance of the model with more training? We can test this by increasing epochs with the current optimal values for dropout and learning rate. We can run more optimization trials to see if the optimal values hold.
Can we improve validation accuracy with more data augmentation?
Do we need dropout? We didn’t explore 0% but we should.
Do higher levels of dropout show improved results with more training?

Experiments prompt questions, which prompt more experiments.

Add a Guild File

Up to this point you run scripts directly. In this step you run an operation defined in a Guild file.

github.com

guildai/guildai/blob/main/examples/tensorflow2/guild.yml

train:
  description: Train a simple neural network to classify MNIST digits
  main: beginner_with_flags
  flags-import: all
  flags:
    epochs: 2
    dropout: 0.1
    learning_rate: 0.002

The operation runs the beginner_with_flags Python module. It provides a description and default flag values.

List the project operations:

guild operations

train  Train a simple neural network to classify MNIST digits

Show help for the project:

guild help

OVERVIEW

    You are viewing help for operations defined in the current directory.

    To run an operation use 'guild run OPERATION' where OPERATION is one
    of options listed below. If an operation is associated with a model,
    include the model name as MODEL:OPERATION.

    To list available operations, run 'guild operations'.

    Set operation flags using 'FLAG=VALUE' arguments to the run command.
    Refer to the operations below for a list of supported flags.

    For more information on running operations, try 'guild run --help'.
    For general information, try 'guild --help'.

BASE OPERATIONS

    train
      Train a simple neural network to classify MNIST digits

      Flags:
        activation     (default is relu)
        dropout        (default is 0.1)
        epochs         (default is 2)
        learning_rate  (default is 0.002)

Guild files document project capabilities, as well as enable them.

Run the operation:

guild run

Guild trains the model using the optimal hyperparameter values. Compare the results to earlier runs:

guild compare

Use arrow keys to navigate the list. Move to the accuracy column. The accuracy of the latest run — the run at the top of the listing — should rank among the best results.

Summary

In this example you train a standard TensorFlow example. The original code remains essentially unchanged. You improve the code with variables that define otherwise hard-coded hyperparameters. You don’t import or use Guild modules. Instead you augment the project with a Guild file. This is all you need to enable a host of features.

For a more detailed step-by-step tutorial, see Get Started with Guidl AI. If you’re already familiar with core Guild features (you learned a lot already in this example), skip to Use Guild in a Project for help applying Guild to your work.

Topic		Replies	Views
Guild AI Examples Examples	0	4964	June 16, 2020
Running multiple batches of an experiment with different hyperparameter flag values Troubleshooting	3	576	July 19, 2021
Languages Example Examples	0	1791	June 29, 2020
Binary Classifier Notebook Examples	0	1961	January 14, 2021
Guild AI Documentation Documentation	0	22052	June 9, 2020