Confusion on multistep operations, restarting substeps, and copied files?

I have a guild operation main that runs 3 steps which are other operations: impute, evaluate, and predict. The latter two require on the impute operation (specifically a model checkpoint and some data output).
(guild.yml file for reference)

  1. When one of the steps fails (e.g. evaluate), the main op shows error and so does evaluate. If I fix the error in the code and restart the run with something like for hash in $(guild select --operation evaluate --error --all); do guild run -y --background --restart $hash --force-sourcecode; done, then the evaluate op fixes to completed, but the main operation does not. It doesn’t seem very possible to update it, but it is slightly unclean and annoying to keep track of what broke and what is fixed. I end up with something like:
[71:ec03c916]   evaluate  2023-02-20 14:43:57  completed  dvae myexperiment
[72:957ecb30]   evaluate  2023-02-20 14:43:56  completed  dvae myexperiment
[73:19493e6b]   evaluate  2023-02-20 14:43:56  completed  dvae myexperiment
...
[127:fe72a7ff]  predict   2023-02-18 20:58:56  completed  dvae 
[128:617bc8fd]  impute    2023-02-18 20:26:16  completed  dvae myexperiment
[129:2b155ff0]  main      2023-02-18 20:26:14  error      dvae 
[130:39125144]  predict   2023-02-18 20:21:08  completed  dvae 
[131:5c4ed46a]  impute    2023-02-18 19:45:25  completed  dvae myexperiment
[132:c542fcbe]  main      2023-02-18 19:45:24  error      dvae 

It said error for main but it’s really been fixed sine the evaluate op was fixed.
Another issue is also what files are stored under each op which leads me to the next point, where ill use run 132 as an example:

  1. If I look at what is stored under the main op I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ ls
evaluate  impute  options.yml  predict

If I drill into the directories I see:

me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ cd evaluate
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a/evaluate$ ls
F.O.  options.yml  serialized_models

If evaluate fails and I rerun it, does that mean that the evaluatefolder will be updated too (is it a symlink)? There seems to be some redundancy too which leads me to:

  1. If I look at the output of the substeps impute and predict I see:
# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345/serialized_models$ ls
AEDitto_STATIC.pt  imputed_data.pkl  STATIC_test_dataloader.pt
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a/serialized_models$ ls
AEDitto_STATIC.pt  imputed_data.pkl  STATIC_test_dataloader.pt

I also see

# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml

It looks like it copies over everything from the impute op top the parents: main, and dependent steps: predict, and evaluate. This is a lot of redundancy especially for expensive/large models and artifacts. This is making me run out of space on my machine.

My questions are
a) How do I avoid redundancy in stored artifacts between parent and child steps like main having substeps.
b) How do I avoid redundancy amongst sibling runs where one may be dependent on another? While evaluate relies on the artifacts from impute I don’t want it to store all the artifacts all over again (including the model checkpoints, data, and the logging files), I just want evaluate to use the checkpointed data and model. I know there’s a select: option but it seems to be regex, making it complicated to select the checkpointed model AND data. Also even if that solves excluding the logged files, I don’t want to copy over the files it relies on to the final logged artifacts.

Following up on this if anyone has any ideas!

Thanks @davzaman for the questions/report! I’d like to give you a minimal example that shows what I think is going on. Guild should definitely not be copying run files over when it resolves dependencies. If it is, something surprising is going on there.

The directories under the stepped/parent run are indeed symlinks to the child run directories. They do not contain copies of the child runs.

As for the stepped operation not showing completed after you resolve the child run status, that is annoying but it does technically reflect the status of the operation, which is based on the exit code of the stepped (parent) run. The status is not inferred from the status of the child runs.

I’d like to be able to say “restart the parent, re-running an failed runs”, which would give the parent the chance to update its exit codes/status. Guild should support that but I’d like to show that in the minimal case that we can run and demo if need be.

Give us a day or so to look into this.

1 Like

Sounds good!

Gotcha so they are symlinks, that’s great to know!

Yeah that’s what I figured, simply speaking to correct the main operation the main operation should be rerun, as it’s technically gestalt of its steps (I’m assuming) rather than just a minimal wrapper.

That operation would be really useful!

Thanks, let me know if I can help.

@davzaman It looks like there was a regression and Guild is indeed copying resolved operation dependency files. This is not the intended behavior and we’ll fix that ASAP.

As a workaround, avoid copying files by adding target-type to your dependency def like this:

upstream: {}

downstream:
  requires:
    - operation: upstream
      target-type: link  # tells Guild to link to the resolved files, not copy

The downstream operation is any operation that requires an upstream run.

Sorry about that! This will make a big difference in disk space for you. We’ll post here when the fix is applied, after which you can remove the explicit target-type in your dependencies.

1 Like

@davzaman linking (rather than copying) is now the default behavior for an operation’s resolved resources, no target-type required.

guild also has the --needed flag when you want to restart your pipeline, which allows you to only re-run the failing operations that are needed for the pipeline to pass.

1 Like

Thank you for the update!

Question about the --needed flag: say I manually reran the failing substeps of the pipeline. If I restart the parent operation with the --needed flag, will it update the status of the parent op without rerunning anything? I found this document about restarting an op with steps in it. It seems like my guess is correct but I figured I’d ask!

yep, that’s right. if you’re going through and individually restarting/fixing steps from the pipeline, you can then rerun the pipeline with --needed to update its status.

also, if you’re confident that you’ve fixed all the issues with your pipeline, you can just supply the --needed flag directly to a restart of the pipeline, which will automatically go through and efficiently fix/check each step, and only restart when necessary.

you found the relevant core tests that cover this topic, and there’s also this issue resolution doc that covers the issue as well (mostly the same information).

feel free to ask if you have any more questions :slight_smile: