I have a guild operation main
that runs 3 steps which are other operations: impute
, evaluate
, and predict
. The latter two require on the impute
operation (specifically a model checkpoint and some data output).
(guild.yml file for reference)
- When one of the steps fails (e.g.
evaluate
), themain
op shows error and so doesevaluate
. If I fix the error in the code and restart the run with something likefor hash in $(guild select --operation evaluate --error --all); do guild run -y --background --restart $hash --force-sourcecode; done
, then theevaluate
op fixes to completed, but themain
operation does not. It doesn’t seem very possible to update it, but it is slightly unclean and annoying to keep track of what broke and what is fixed. I end up with something like:
[71:ec03c916] evaluate 2023-02-20 14:43:57 completed dvae myexperiment
[72:957ecb30] evaluate 2023-02-20 14:43:56 completed dvae myexperiment
[73:19493e6b] evaluate 2023-02-20 14:43:56 completed dvae myexperiment
...
[127:fe72a7ff] predict 2023-02-18 20:58:56 completed dvae
[128:617bc8fd] impute 2023-02-18 20:26:16 completed dvae myexperiment
[129:2b155ff0] main 2023-02-18 20:26:14 error dvae
[130:39125144] predict 2023-02-18 20:21:08 completed dvae
[131:5c4ed46a] impute 2023-02-18 19:45:25 completed dvae myexperiment
[132:c542fcbe] main 2023-02-18 19:45:24 error dvae
It said error
for main
but it’s really been fixed sine the evaluate
op was fixed.
Another issue is also what files are stored under each op which leads me to the next point, where ill use run 132
as an example:
- If I look at what is stored under the
main
op I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ ls
evaluate impute options.yml predict
If I drill into the directories I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ cd evaluate
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a/evaluate$ ls
F.O. options.yml serialized_models
If evaluate
fails and I rerun it, does that mean that the evaluate
folder will be updated too (is it a symlink)? There seems to be some redundancy too which leads me to:
- If I look at the output of the substeps
impute
andpredict
I see:
# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345/serialized_models$ ls
AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a/serialized_models$ ls
AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt
I also see
# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml
It looks like it copies over everything from the impute
op top the parents: main
, and dependent steps: predict
, and evaluate
. This is a lot of redundancy especially for expensive/large models and artifacts. This is making me run out of space on my machine.
My questions are
a) How do I avoid redundancy in stored artifacts between parent and child steps like main
having substeps.
b) How do I avoid redundancy amongst sibling runs where one may be dependent on another? While evaluate
relies on the artifacts from impute
I don’t want it to store all the artifacts all over again (including the model checkpoints, data, and the logging files), I just want evaluate
to use the checkpointed data and model. I know there’s a select:
option but it seems to be regex, making it complicated to select the checkpointed model AND data. Also even if that solves excluding the logged files, I don’t want to copy over the files it relies on to the final logged artifacts.