Restart behavior of Guild steps (pipelines)

Summary

We propose the following changes to the steps support in Guild:

  1. When restarting a pipeline run, new steps should not be started but rather restarted in place. This should be the equivalent of running guild run --restart <step run ID>
  2. When restarting a pipeline, Guild requires an additional --restart-all or --restart-failed option to clarity the user intent

Rationale for 2:

  • Restarting a pipeline where all steps have completed warrants additional confirmation by the user due to the costly nature of a pipeline and the effects of terminating a pipeline run that was accidentally restarted.
  • Restarting a pipeline with failed runs could mean “restart all” or “restart failed” — we want the user to make this explicit.

This is a breaking change. Note however that the current implementation for steps is not useful.

This proposal is under development

Problem

See above (TODO move details here).

Proposed Approach

Alternative Approaches

Do nothing

Alternative options and default values

TODO: Maybe outline different spellings above??

Implementation notes

  • Changes to guild.steps_main and guild.steps_util
  • Guild should avoid creating new runs on restart — this might mean it initializes steps up front on create and reuses those linked dirs on restart (problem of applying flag changes to these runs still exist but would avoid the new step issue we see now)
  • Changes to steps in the Guild file are not applied to the run being restarted (how to handle new step defs??)
  • Maybe address Guild’s poor handling of operation flag config — i.e. lack of pass through ability