Change default select rules for operation dependencies

Summary

This proposal addresses problems with Guild’s default select rules for required operations (runs). The rules as of 0.8.2 are too inclusive and end up selecting files that are not applicable to downstream runs.

This is a breaking change

This proposal is awaiting feedback

Problem

As of 0.8.2, by default Guild selects all upstream run files for links in the downstream run.

Consider a prepare-data run, which contains these files:

~/runs/abcd1234:
  data-in.csv
  data-out.npy
  prepare_data.py
  train.py

This directory contains the run input data-in.csv, the generated output data-out.npy, and some source code files.

Per 0.8.2 rules, a train operation that requires prepare-data files would contain these files:

~/runs/abcd1234:
  data-in.csv (link)
  data-out.npy (link)
  prepare_data.py (link)
  train.py (link)
  model.joblib

To prevent this problem, the train operation needs to be defined like this:

# guild.yml
train:
  requires:
    - operation: prepare-data
      select: model.joblib

Proposed Approach

Guild will select on run-generated files by default when resolving operation (run) dependencies.

Post 0.8.2, Guild logs run files using a manifest and so knows what files are generated by a run. These files can be listed by specifying the -g,--generated flag in guild ls.

This changes BREAKS Guild behavior and may cause existing operations to fail.

Backward compatibility

Guild will support an environment variable GUILD_LEGACY_OPDEP_SELECT that, when set to 1, will cause Guild to select both generated and dependency upstream run files by default. This is not the same behavior as pre 0.8.2 because source code files are not selected. To select all upstream run files by default, set GUILD_LEGACY_OPDEP_SELECT to 2.

Alternative Approaches

Do nothing

Guild can no longer select all files in an upstream directory by default because Guild, post 0.8.2, Guild copies operation source code to the run root.

While Guild could continue to include upstream run dependencies (i.e. files required by the upstream run), this is technically incorrect. If a downstream run requires files that are also required by the upstream run, it should define those explicitly.

Users who are negatively impacted by this backward-incompatible change can set GUILD_LEGACY_OPDEP_SELECT to 1 to revert to Guild’s legacy behavior.