Distributing runs on a multi-gpu machine

I’m trying to optimally distribute many runs on a multi-gpu system so that I can complete the experiments in minimum total running time.

I was wondering how guild could help with this?

You need to do some work for this. We’re working on streamlining the interface for this functionality, but in the meantime, this should get you what you need:

This is my approach using queues on an 8 GPU machine.

First, I set up a queue for each GPU. This only needs to be done once.

for i in {0..7}
do
  guild run --background --yes queue gpus=$i
done

I can check that my queues are running with

guild runs -Fo queue

To submit jobs, I use, e.g.,

guild run --stage-trials --quiet train.py learning_rate='logspace[-3:0:4]' batch_size='[1,2,5,10,20]'

The queues automatically grab jobs and run them.

3 Likes