Using guild for data parallelization

copah · April 18, 2022, 2:07pm

With the upcoming integration with Slurm, I am interested in a good way of using Guild + Slurm to parallelize heavy computation across data.

Say for an example I have a preprocessing step wrapped in guild that takes a large video file as input and spits out some processed video. For this example, say I have three of such videos. Currently I would do something like this in guild:

guild run model:preprocess video_file_path='[path1, path2, path3]'

Optionally use the dask scheduler to run in parallel on one machine. I imagine I would be able to do the same using Slurm and it will be awesome.

My one issue with this, is that this isn’t really three logically separate experiments, but essentially just one experiment.

I was wondering if there was a way to merge such a batch run into a single experiment? Maybe something like:

guild merge RUN_1 RUN_2 RUN_3

Or something similar. Maybe there is someone out there with a better suggestion on how to handle this.

Looking forward to the discussion.

copah · April 21, 2022, 6:07pm

So I realize that guild actually has some of this functionality.

If I do --keep-batch when doing a batch run, the batch run persists and it actually has a merged folder structure that I am looking for.

This actually solves my issue

copah · April 28, 2022, 3:03pm

@garrett

Is there anyway when specifying a pipeline to have one operation resolve it’s dependency on a batch run instead of an individual run such a batch run has produced?

Essentially I would like to something like this:

`guild run model:preprocess video_file_path=’[path1, path2, path3]’
`guild run model:train model_preprocess_op=<BATCH_RUN_ID>

Above works (only if you use the full RUN ID by the way), but I would like to specify that in a guild pipeline.

garrett · April 29, 2022, 2:35pm

Sorry for the late reply here! Yes I think so, let me work up an example to confirm that it works. Stay tuned.

garrett · April 29, 2022, 3:15pm

There’s a bug in Guild that prevents what I should work from working. I created a GitHub issue to track this.

Here are the details on what I’m seeing:

There’s a workaround in there that might help in your case.

You should be able to specify a batch requirement using <op name>+ in the dependency, but this isn’t working. As a workaround you can use the select command with command substitution.

guild run summary batch=`guild select --operation op+`

copah · May 3, 2022, 12:18pm

@garrett

Sorry for the late reply. Just completed a coast to coast move! Will respond in the GitHub issue.

copah · May 10, 2022, 11:13am

@garrett

In relation to this, are there any plans for a “join/merge” operation? I am considering something like this:

guild join <LIFT_OF_RUN_IDS>
guild join --tag <MY_TAG>
guild join --label <MY_LABEL>

Which will create a new experiment with links/copies of the joined runs.

This would make it easy to e.g. add a new experiment to a set of experiments used for some kind of summary operation or in the case of this topic, process more data after the first batch run and then add to the same batch run.

garrett · May 10, 2022, 2:55pm

Yes, this is what I’ve been casually referring to as “summary ops” for now quite a while (with little progress alas!)

I wrote up my thinking here: Summary operations.

Please feel free to comment there — your input on the feature is most welcome!

Topic		Replies	Views
Use with slurm General	3	841	September 16, 2020
How do you use GuildAI with Slurm/remote jobs? General	9	1462	June 23, 2024
Distributed remote workflow question General	2	691	July 27, 2021
How can I define models in guild and run them against different training procedures? General	1	539	March 22, 2022
Guild steps and pipeline - reuse same run General	9	1689	July 2, 2021

Using guild for data parallelization

Related topics