Restart behavior of Guild steps (pipelines)

garrett · March 13, 2023, 8:42pm

Summary

We propose the following changes to the steps support in Guild:

When restarting a pipeline run, new steps should not be started but rather restarted in place. This should be the equivalent of running guild run --restart <step run ID>
When restarting a pipeline, Guild requires an additional --restart-all or --restart-failed option to clarity the user intent

Rationale for 2:

Restarting a pipeline where all steps have completed warrants additional confirmation by the user due to the costly nature of a pipeline and the effects of terminating a pipeline run that was accidentally restarted.
Restarting a pipeline with failed runs could mean “restart all” or “restart failed” — we want the user to make this explicit.

This is a breaking change. Note however that the current implementation for steps is not useful.

This proposal is under development

See above (TODO move details here).

TODO: Maybe outline different spellings above??

Changes to guild.steps_main and guild.steps_util
Guild should avoid creating new runs on restart — this might mean it initializes steps up front on create and reuses those linked dirs on restart (problem of applying flag changes to these runs still exist but would avoid the new step issue we see now)
Changes to steps in the Guild file are not applied to the run being restarted (how to handle new step defs??)
Maybe address Guild’s poor handling of operation flag config — i.e. lack of pass through ability

Topic		Replies	Views
Guild steps and pipeline - reuse same run General	9	1685	July 2, 2021
Guild Stage-Trials Rerun on Error Troubleshooting	1	95	June 18, 2024
Can Guild take advantage of cached results? General	3	670	April 12, 2021
Confusion on multistep operations, restarting substeps, and copied files? Troubleshooting	7	424	May 2, 2023
Restarting multiple runs General	9	1470	July 28, 2020