Guild steps and pipeline - reuse same run

Is there a way to have a pipeline reuse the same run?

An example part of my guild file:

- model: segmentation
  operations:
    train_and_convert_pipeline:
      steps:
        - train_model batch_size=${batch_size} dryrun=${dryrun} num_epochs=${num_epochs} sample_ids=${sample_ids}
        - utils:convert_to_onnx
      flags:
        $include: train-flags

This will produce a run for the train_model op and a run for the utils:convert_to_onnx op. Is there a way to have the utils:convert_to_onnx reuse the same run as train_model? The utils:convert_to_onnx essentially just saves an additional file, so would be nice to not have to keep track of an entire separate run for this.

I don’t think I understand “reuse a run”.

Guild runs use files as inputs. Runs generate files and those files can be used by subsequent runs. So any files generated by train_model can be made available to convert_to_onnx.

You can read about this in Dependencies.

But you may be asking about something else so I’m not sure my answer here is helpful.

Let me try to rephrase. The above pipeline will create two guild runs, one for the train_model op and one for the utils:convert_to_onnx. The utils:convert_to_onnx will use train_model run as input and that all works as expected.

I see the utils:convert_to_onnx as a ‘patch’ to the train_model run, i.e. in my mental model they should be the same run/experiment and not two separate runs.

Consider for an example a situation where I want to share the onnx model produced from the utils:convert_to_onnx with some team members. In this case I have two share both runs in order for them to have all info about what generated the onnx model. Or at least that how I currently understand.

Did that clarify?

I understand now — thank you for the clarification!

Guild is not really setup to do something like this. In Guild, a run, once completed, is informally considered read-only. Guild does not currently enforce this read-only state, but I think it should. The thinking is that, once a run is completed, it’s set and should not later be changed. Future releases of Guild will likely formally support this via these mechanisms:

  • Set read-only file status for the run directory and run files
  • Generate a digest for the read-only run
  • Support checking a run against the digest to detect changes

These are all important considerations for reproducibility and audability.

However, the patching scenario that you describe is quite common — and generation of a runnable artifact is a good example. Another example might be model compression, quantization, etc.

From Guild’s point of view, these patch operations should be separate runs. This keeps the upstream runs immutable and separates any newly generated artifacts. If the downstream operation is meant to modify an upstream file, it should use a copy dependency and modify its own copy of the upstream file.

upstream: {}     # generates some file foo.txt
downstream:      # compresses foo.txt
  requires:
    - op: upstream
      select: foo.txt
      target-type: copy

In Guild 0.7.x the default target type is link. To copy you need to explicitly use the copy target type as per the example above. This will change in 0.8 so that copy is the default. If you want to link, you’ll need to use link. In that case, the link will be read-only — again, using the rationale above.

Now, all this said, Guild does support a --restart option, which is specifically designed to re-run an operation from within a run directory. This is really intended for use with terminated or error status but works just as well with completed status. The use case this addresses is the common case of restarting a run that stopped early or failed — e.g. to train more or to fix a bug without having to restart a run from scratch.

To your case, I would first consider the Guild approach I describe above, where patches are really just additional runs. Think of this like a copy-on-write file system, where changes are implemented as additional transformations rather than in-place edits. Docker images e.g. work this way.

If you strongly prefer to edit the run files in place, you still need a second run. You can link to the files that you want to modify and then delete the patch run afterward. However, I think this is not ideal. The patch is a meaningful operation, which I think you should record. The second run formally captures the patch operation, including the source code used, flags, results, etc. If you delete this run, you lose that record.