Summary
This proposal addresses problems with Guild’s default select rules for required operations (runs). The rules as of 0.8.2 are too inclusive and end up selecting files that are not applicable to downstream runs.
This is a breaking change
This proposal is awaiting feedback
Problem
As of 0.8.2, by default Guild selects all upstream run files for links in the downstream run.
Consider a prepare-data
run, which contains these files:
~/runs/abcd1234:
data-in.csv
data-out.npy
prepare_data.py
train.py
This directory contains the run input data-in.csv
, the generated output data-out.npy
, and some source code files.
Per 0.8.2 rules, a train
operation that requires prepare-data
files would contain these files:
~/runs/abcd1234:
data-in.csv (link)
data-out.npy (link)
prepare_data.py (link)
train.py (link)
model.joblib
To prevent this problem, the train
operation needs to be defined like this:
# guild.yml
train:
requires:
- operation: prepare-data
select: model.joblib
Proposed Approach
Guild will select on run-generated files by default when resolving operation (run) dependencies.
Post 0.8.2, Guild logs run files using a manifest and so knows what files are generated by a run. These files can be listed by specifying the -g,--generated
flag in guild ls
.
This changes BREAKS Guild behavior and may cause existing operations to fail.
Backward compatibility
Guild will support an environment variable GUILD_LEGACY_OPDEP_SELECT
that, when set to 1
, will cause Guild to select both generated and dependency upstream run files by default. This is not the same behavior as pre 0.8.2 because source code files are not selected. To select all upstream run files by default, set GUILD_LEGACY_OPDEP_SELECT
to 2
.
Alternative Approaches
Do nothing
Guild can no longer select all files in an upstream directory by default because Guild, post 0.8.2
, Guild copies operation source code to the run root.
While Guild could continue to include upstream run dependencies (i.e. files required by the upstream run), this is technically incorrect. If a downstream run requires files that are also required by the upstream run, it should define those explicitly.
Users who are negatively impacted by this backward-incompatible change can set GUILD_LEGACY_OPDEP_SELECT
to 1
to revert to Guild’s legacy behavior.