I have a guild operation
main that runs 3 steps which are other operations:
predict. The latter two require on the
impute operation (specifically a model checkpoint and some data output).
(guild.yml file for reference)
- When one of the steps fails (e.g.
mainop shows error and so does
evaluate. If I fix the error in the code and restart the run with something like
for hash in $(guild select --operation evaluate --error --all); do guild run -y --background --restart $hash --force-sourcecode; done, then the
evaluateop fixes to completed, but the
mainoperation does not. It doesn’t seem very possible to update it, but it is slightly unclean and annoying to keep track of what broke and what is fixed. I end up with something like:
[71:ec03c916] evaluate 2023-02-20 14:43:57 completed dvae myexperiment [72:957ecb30] evaluate 2023-02-20 14:43:56 completed dvae myexperiment [73:19493e6b] evaluate 2023-02-20 14:43:56 completed dvae myexperiment ... [127:fe72a7ff] predict 2023-02-18 20:58:56 completed dvae [128:617bc8fd] impute 2023-02-18 20:26:16 completed dvae myexperiment [129:2b155ff0] main 2023-02-18 20:26:14 error dvae [130:39125144] predict 2023-02-18 20:21:08 completed dvae [131:5c4ed46a] impute 2023-02-18 19:45:25 completed dvae myexperiment [132:c542fcbe] main 2023-02-18 19:45:24 error dvae
main but it’s really been fixed sine the
evaluate op was fixed.
Another issue is also what files are stored under each op which leads me to the next point, where ill use
run 132 as an example:
- If I look at what is stored under the
mainop I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ ls evaluate impute options.yml predict
If I drill into the directories I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ cd evaluate me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a/evaluate$ ls F.O. options.yml serialized_models
evaluate fails and I rerun it, does that mean that the
evaluatefolder will be updated too (is it a symlink)? There seems to be some redundancy too which leads me to:
- If I look at the output of the substeps
# impute op me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345/serialized_models$ ls AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt # predict op me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a/serialized_models$ ls AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt
I also see
# impute op me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/ events.out.tfevents.1676780435.lambda2.6521.0 events.out.tfevents.1676780445.lambda2.6521.1 events.out.tfevents.1676780453.lambda2.6521.2 hparams.yaml # predict op me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/ events.out.tfevents.1676780435.lambda2.6521.0 events.out.tfevents.1676780445.lambda2.6521.1 events.out.tfevents.1676780453.lambda2.6521.2 hparams.yaml
It looks like it copies over everything from the
impute op top the parents:
main, and dependent steps:
evaluate. This is a lot of redundancy especially for expensive/large models and artifacts. This is making me run out of space on my machine.
My questions are
a) How do I avoid redundancy in stored artifacts between parent and child steps like
main having substeps.
b) How do I avoid redundancy amongst sibling runs where one may be dependent on another? While
evaluate relies on the artifacts from
impute I don’t want it to store all the artifacts all over again (including the model checkpoints, data, and the logging files), I just want
evaluate to use the checkpointed data and model. I know there’s a
select: option but it seems to be regex, making it complicated to select the checkpointed model AND data. Also even if that solves excluding the logged files, I don’t want to copy over the files it relies on to the final logged artifacts.