How to have optional run resources

I have multiple datasets that I preprocess using different guild operations. My train operation in turn uses these runs as resources, and I want to be able to only input a selected subset of these preprocess runs to my train operation to be able to experiment with how different data subsets affect the training. How do you suggest I do this? My preferred way would be to have all resources default to not being used and only be included if I specifically set their values, but this does not seem to be possible?

Generally you resolve one upstream data prep run for a downstream training run. Guild doesn’t even support resolving multiple runs per resource, unless you explicitly configure it with multiple dependencies.

The problem is then how you can select the applicable upstream run to experiment with.

There are a couple methods. The first is simply to specify the run ID for the applicable dependency name.

prepare-data: {}
train:
  requires:
    - operation: prepare-data

By default Guild always uses the latest non-error status run for a specified operation. You can override this however when you run train by specifying a run ID this way:

guild run train prepare-data=<run ID>

If you don’t want to copy-paste run IDs, you can use select with command substitution to provide a run ID using Guild filters. E.g. let’s say you have a prepare data run that’s been tagged with augmentation-2 (just an example) you can use this:

guild run trin prepare-data=`guild select -Ft augmentation-2`

If you have multiple prepare data operations you can use a pattern when specifying the dependency.

prepare-data-1:
  main: prepare_data --v1

prepare-data-2:
  main: prepare_data --v2

train:
  requires:
    - operation: prepare-data-.+
      name: data

Guild will select the latest run for an operation matching prepare-data-.+ but you can override this as well using the resource name data like this:

guild run train data=<run ID>

The second method gets ever trickier using flag substitutions.

prepare-data-v1:
  main: prepare_data --v1

prepare-data-v2: 
  main: prepare_data --v2

train:
  requires:
    - operation: prepare-data-${dataset-type}
      name: data
  flags:
    dataset-type:
      choices: [v1, v2]
      default: v1
      arg-skip: yes

This bit of madness lets you control the operation name being resolve using a flag.

There’s a lot here so please followup with any questions. Also I’m 100% sure I understood your question completely but hopefully there’s something on point here :slight_smile:

Your second method almost solves the problem for me, but I would still like to be able to have the resources be optional.

Here’s an example of what I want to do, just to make it clearer.

prepare-data-1:
  main: prepare_data_1  # these aren't versions of the data, but actually completely different data

prepare-data-2:
  main: prepare_data_2

train:
  requires:
    - operation: prepare_data_1
      name: data_1
      optional: yes  # would like this
    - operation: prepare_data_2
      name: data_2
      optional: yes  # would like this

Now, I know this isn’t currently possible but this is how I would prefer to solve my problem. I would like the optional flag to make the default behaviour be to not include the run at all, unless the run is specified. Does it make sense to implement this functionality you think?

Another solution I’ve considered is something like this

prepare-data-1:
  main: prepare_data_1

prepare-data-2:
  main: prepare_data_2

empty:
  main: empty  # does nothing

train:
  requires:
    - operation: prepare_data_1|empty
      name: data_1
    - operation: prepare_data_2|empty
      name: data_2

This would would solve my problem but is a bit of a hack and quite ugly.

In case you don’t want to implement my above suggestion into guild, do you have another suggestion for how to solve this problem?

Ah, I understand better now. Thank you for the clarification.

You’ve identified a problem that Guild does not explicitly handle. I think designating a dependency as optional is a possible enhancement we could make. There are some other options we can look at as well.

I believe this requirement relates tangentially to an upcoming “summary operation” feature. The idea of a “summary operation” is that it performs some task on multiple runs. The common scenario is when you want to analyze runs to perform higher level summaries (e.g. find an optimal training run using a custom selection algorithm, generate reports, etc.) While your case is not strictly a summary op, it does follow the same pattern: a run needs access to one or more runs that match selection criteria. In your case it’s a data prep run.

I’m improvising here, but what if we added an attribute that designates an operation dependency as applying to “multiple runs”?

# guild.yml

prepare-data-1: {}
prepare-data-2: {}

train:
  requires:
    - operation: prepare-data-.+
      multi-run: yes
      target-path: data

By default this would resolve the most recent non-error runs matching the operation selection criteria — in this case names starting with prepare-data-. Selected files from each run would be linked under a data subdirectory. To formally handle name collisions, a target path could support references to run attributes like ID or resource index. For example, target-path: data-${id} could be used to install resolved data files under unique subdirectories containing the applicable run ID.

In the meantime, the work-around that you’ve presented should work. I definitely agree we don’t want to rely on that long term.

A couple changes I think we should also consider:

  • --force-deps option to run to continue even when a required resource can’t be resolved.

  • As an alternative to optional (an optional requirement feels a bit dissonant to me :slight_smile: ) we could provide a fail-if-unmet which defaults to yes but could be set to no to accomplish what you ultimately asked for (even spelled out in the title of the topic!) Guild currently supports fail-if-empty which tests whether any files are actually selected — so there’s some precedence for this approach.

3 Likes

Thanks for the reply, the multi-run attribute would be a good solution to my problem! If I understand the idea correctly, it would support supplying a list of runs as the required resource, something like guild run train prepare-data=<id1>,<id2>,<id3>?

In my case, the runs would never have name collisions so I would actually prefer if the files were just dumped in the same target path. It would be nice if this was the default behavior, but the user has the option to specify something like target-path: data-${id} as you suggested if the user knows there will be name collisions.

I also agree that an optional requirement doesn’t make a whole lot of sense :sweat_smile: Your suggested alternatives sound good.

Yes, you have all of that right. The default target path is . (the root run directory) and that would apply to the new multi-run feature. Guild already warns when sources collide so you’d know when that happens. The use of run-specific references (e.g. ${id}) is optional for those who would need it.

1 Like

This has been lingering! We have a request for optional for a different use case:

I think this needs to land, in addition to the multi-run feature.

Will there be an RFC on the optional and multi-feature feature?

Late late late but here’s the RFC and the feature is committed to main and will be available in 0.8.2 (out in the next week or so)