Optional operation/run dependencies

garrett · October 4, 2022, 6:43pm

Summary

This proposal seeks to address Guild’s requirement that an upstream run be available for an operation dependency. It proposes new operation attributes that control how Guild responds when it can’t find a suitable run for an operation dependency. fail-if-unresolved may be set to false to prevent run failure and warn-if-unresolved can be set to silence warnings when a dependency can’t be resolved.

This proposal is awaiting feedback

Problem

A run fails when Guild cannot resolve its dependencies. In most situations, this is desirable — the run should not proceed if a required resource is not available.

However, there are cases where a user wants required resource links when the resources exist and not otherwise. Such runs are designed to proceed when the resources are not available.

Consider a training run that make use of a previous training run (e.g. to learn from the previous result or to resume training with saved models). In this case, the operation might be defined like this:

train:
  requires:
    - operation: train
       select:
         - saved_model.*

If a previous training run doesn’t exist, the operation will fail. This prevents users from using this sort of self-referencing operation.

Proposed Approach

Guild will support a new operation attributes for a resource source: fail-if-unresolved and warn-if-unresolved. If fail-if-unresolved is true (default) Guild generates an error when the source cannot be resolved. If false, Guild optionally logs a warning based on warn-if-unresolved and proceeds with the run. If warn-if-unresolved is true (default), Guild logs a warning when the resource source cannot be resolved. If false, Guild proceeds with the run without a warning.

To implement a training run that continues when an upstream required operation cannot be found (see above example) an operation can be configured as:

train:
  requires:
    - operation: train
      fail-if-unresolved: false
      select:
        - saved_model.*

In this case, Guild attempts to resolve the dependency by finding a non-error run for train. If it cannot find such a run, it logs a message indicating that it can’t resolve the dependency but continues nonetheless.

To suppress the warning message, the user could specify warn-if-unresolved: false for the resource source.

Alternative Approaches

Do nothing

One Guild user proposed a work-around for this limitation:

While this is an ingenious workaround, it’s unintuitive and complicated compared to the optional attribute. Doing nothing here is not an option.

Use a single `optional` flag

The attribute fail-if-unresolved is arguably a bit pedantic/wonky. A simpler optional flag would be sufficient to address the target problem.

The example above would look like this:

train:
  requires:
    - operation: train
      optional: true
      select:
        - saved_model.*

Drawbacks:

fail-if-xxx is already a convention for options that specify whether or not Guild should continue with the run (e.g. fail-if-empty).
optional does not control whether or not Guild logs a warning when the source cann’t be resolved.
optional is paradoxical for a requirement.

garrett · October 5, 2022, 5:55pm

The attribute fail-if-unresolved feel too wonky. I’m inclined to opt for the single optional attribute proposed as an alternative.

If a resource source is marked as optional it ought not trigger a warning message when it can’t be resolved — the log level should be INFO. There’s no need to control warning levels.

garrett · October 5, 2022, 6:12pm

Note that Guild run has a --force-deps option that can be used to continue when a dependency cannot be resolved. There’s a for the “do nothing” option above as this could serve as a work-around. The optional setting, however, is a more direct solution to the user’s problem.

vitalwarley · October 11, 2022, 7:33pm

I have the following use case:

...
      requires:
        - operation: prepare-data
          select: dataset
        - file: src/python/cvt
        - file: models
        - operation: train
          select: '.+best\.pt'

where one of the op flags is named weights. If I pass weights=best.pt and there is a train op, then its weights is selected. However, I can also use a versioned model in this op with weights=models/task/weights.pt.

Today I use --force-deps if I am in an environment without a train op, but I think it would be better to explicitly solve that with an optional flag, as you mentioned. I don’t really care about any warning because in my case I know from the start if I want to use the weights from a specific run or not. Nonetheless, maybe an attribute to optional, if a user wants the warning to happen, could be done like

...
      requires:
        - operation: prepare-data
          select: dataset
        - file: src/python/cvt
        - file: models
        - operation: train
          optional: true
             warning-if-unresolved: true
          select: '.+best\.pt'

garrett · October 17, 2022, 4:54pm

Terrific - thank you for the feedback! The implementation (under way, yay!) is to spell this as optional and not warn. This is not really a case for warning as the point of the feature is to explicitly exempt the requirement when a run can’t be resolved.

teracamo · October 20, 2022, 7:47am

I notice a change of behavior. I had a validation operation that require the operation train.
Occasionally, I will ask the validation to perform on a train run that was not its original dependency using the command like:

guild run --restart validation train=[another train ops]

This used to run without problem for 0.8.1 but with 0.8.2 I now got the error message:

guild: cannot specify a value for 'train' when restarting [SHA]- resource has already been resolved

I wonder if this is intended?

Edit:

After some tweak I discover I can override this by deleting the file .guild/attrs/deps, but I am not sure if this will break anything

Topic		Replies	Views
How to have optional run resources General	8	1072	October 5, 2022
Change default select rules for operation dependencies RFC	0	242	January 18, 2023
Regular expression not detecting latest successful run for required operation dependency Troubleshooting	2	607	February 24, 2021
Summary operations RFC	4	362	March 23, 2023
Batch eval runs over batch train runs via required resource? General	4	751	February 4, 2021