Select the best model checkpoint for downstream runs

Overview

You just trained your model over several epochs. The operation saves checkpoints along the way. It’s tempting to use the latest checkpoint, but that’s not always the most accurate. Given a set of checkpoint, how does a downstream run select the right one?

You can use this technique:

  1. Save files using names that include the numeric selection criteria.
  2. Use select-min or select-max with an operation dependency to select the file using the selection criteria.

Include metrics in generated file names

If your selection criteria for a file is “validation accuracy”, save applicable files with the numeric validation accuracy in the name.

For example, the Keras checkpoint callback saves checkpoints during training. Here’s an example that saves model weights with a file name that includes the applicable epoch and corresponding validation accuracy:

When you run over five epochs, you end up with five files.

<run dir>
weights.01-0.9647.hdf5
weights.02-0.9738.hdf5
weights.03-0.9775.hdf5
weights.04-0.9813.hdf5
weights.05-0.9805.hdf5

By chance in this case the most “accurate” set of weights is epoch 4 not 5. While this is a contrived example, it’s common in training runs for validation accuracy to decline at a certain point.

Select files with min and max patterns

You can select files from an upstream run using select-min and select-max resource source attributes. These apply a pattern to filenames and select a single file name that has the min or max value.

Here’s a downstream operation test that uses select-max to select the weights file with the highest accuracy.

test:
  requires:
    - operation: train
      select-max: weights\.\d+-0\.(\d+)\.hdf5

If you want to use a consistent name for the selected file, use rename.

test:
  requires:
    - operation: train
      select-max: weights\.\d+-0\.(\d+)\.hdf5
      rename: weights.+\.hdf5 weights.hdf5

Note that in the case of a renamed file, the link is maintained to the original generated file so you can resolve the source.

As a matter of best practices, leave names unchanged and modify your test script to discover the target files by inspecting a directory. Guild does not support passing selected files as flags.

Rationale

The problem of selecting the “best” file for an operation is hard if you don’t otherwise provide some information about the file. You could resolve all upstream files and rely on the downstream operation to select “best”. That’s a reasonable strategy but it places the burden on the downstream operation. If you know ahead of time the selection criteria values it’s more efficient to encode those values in the generate file names. Then Guild can select “best” by applying a max or min filter using a file name pattern.