Distributing runs on a multi-gpu machine

sjoshi804 · March 24, 2022, 7:04am

I’m trying to optimally distribute many runs on a multi-gpu system so that I can complete the experiments in minimum total running time.

I was wondering how guild could help with this?

garrett · March 25, 2022, 3:30pm

You need to do some work for this. We’re working on streamlining the interface for this functionality, but in the meantime, this should get you what you need:

mtmccann · October 19, 2022, 8:36pm

This is my approach using queues on an 8 GPU machine.

First, I set up a queue for each GPU. This only needs to be done once.

for i in {0..7}
do
  guild run --background --yes queue gpus=$i
done

I can check that my queues are running with

guild runs -Fo queue

To submit jobs, I use, e.g.,

guild run --stage-trials --quiet train.py learning_rate='logspace[-3:0:4]' batch_size='[1,2,5,10,20]'

The queues automatically grab jobs and run them.

Topic		Replies	Views
Assigning runs to GPUs Troubleshooting	4	318	June 12, 2023
Queues Concepts	0	2433	June 12, 2020
Dask Scheduler not utilizing all available resources Troubleshooting	0	359	March 3, 2022
Parallel processing with Dask scheduler Guides	0	1282	February 24, 2021
Dask scheduler not using multiple gpus on remote Troubleshooting	2	703	April 5, 2021

Distributing runs on a multi-gpu machine

Related Topics