Using guild for data parallelization

With the upcoming integration with Slurm, I am interested in a good way of using Guild + Slurm to parallelize heavy computation across data.

Say for an example I have a preprocessing step wrapped in guild that takes a large video file as input and spits out some processed video. For this example, say I have three of such videos. Currently I would do something like this in guild:

guild run model:preprocess video_file_path='[path1, path2, path3]'

Optionally use the dask scheduler to run in parallel on one machine. I imagine I would be able to do the same using Slurm and it will be awesome.

My one issue with this, is that this isn’t really three logically separate experiments, but essentially just one experiment.

I was wondering if there was a way to merge such a batch run into a single experiment? Maybe something like:

guild merge RUN_1 RUN_2 RUN_3

Or something similar. Maybe there is someone out there with a better suggestion on how to handle this.

Looking forward to the discussion.

So I realize that guild actually has some of this functionality.

If I do --keep-batch when doing a batch run, the batch run persists and it actually has a merged folder structure that I am looking for.

This actually solves my issue :slight_smile:

1 Like

@garrett

Is there anyway when specifying a pipeline to have one operation resolve it’s dependency on a batch run instead of an individual run such a batch run has produced?

Essentially I would like to something like this:

  1. `guild run model:preprocess video_file_path=’[path1, path2, path3]’
  2. `guild run model:train model_preprocess_op=<BATCH_RUN_ID>

Above works (only if you use the full RUN ID by the way), but I would like to specify that in a guild pipeline.

Sorry for the late reply here! Yes I think so, let me work up an example to confirm that it works. Stay tuned.

There’s a bug in Guild that prevents what I should work from working. I created a GitHub issue to track this.

Here are the details on what I’m seeing:

There’s a workaround in there that might help in your case.

You should be able to specify a batch requirement using <op name>+ in the dependency, but this isn’t working. As a workaround you can use the select command with command substitution.

guild run summary batch=`guild select --operation op+`

@garrett

Sorry for the late reply. Just completed a coast to coast move! Will respond in the GitHub issue.

@garrett

In relation to this, are there any plans for a “join/merge” operation? I am considering something like this:

guild join <LIFT_OF_RUN_IDS>
guild join --tag <MY_TAG>
guild join --label <MY_LABEL>

Which will create a new experiment with links/copies of the joined runs.

This would make it easy to e.g. add a new experiment to a set of experiments used for some kind of summary operation or in the case of this topic, process more data after the first batch run and then add to the same batch run.

Yes, this is what I’ve been casually referring to as “summary ops” for now quite a while (with little progress alas!)

I wrote up my thinking here: Summary operations.

Please feel free to comment there — your input on the feature is most welcome!