Dask scheduler not using multiple gpus on remote

CptPirx · March 29, 2021, 4:51pm

Hi,
It’s a fresh feature and most likely I’m doing something wrong, but I cannot get the Dask scheduler to work properly on a remote with 4 gpus.

This is what I do, and apart from the remote part it is basically a copy of the steps in the How To guide:

Connect to remote. I have successfully staged runs on remote, run them directly, and also used multiple gpus by assigning runs to gpus manually using --gpus flag. So the remote works corectly.
Start the Dask scheduler on the remote.
Then I stage trials on remote, let’s say 4, with different parameters, using a single command.

guild run TABL:train window=[100,200,300,400] --remote cerberus --stage-trials

The trials are sent to remote, and the scheduler starts them. If workers is set to 4, it starts correctly 4 processes of loading data etc.

And this is where something goes wrong. After doing the pre-procesing stage and creating 4 models concurrently, the scheduler places all 4 models and training processses on all 4 gpus (which can be seen on the screenshot - all gpus have allocated memory). And then only begins the training on 1 gpu.

Expected
I would expect the scheduler to assign the trials to available gpus and train them concurrently. Furthermore, when the number of trials is bigger than the number of gpus, I would expect the scheduler to automatically run the pending operation when a gpus becomes free.

Did I understand what the scheduler is capable of correctly? Is there maybe some manual step somewhere that I missed?

Thanks in advance.

garrett · March 31, 2021, 10:19pm

By default the scheduler will simply run the staged runs concurrently, up to the number of workers specified. It doesn’t assign runs to GPUs. This is a good feature idea, but the initial pass doesn’t do this.

Instead, you need to assign runs to GPUs explicitly. This uses Dask resources to limit runs, as well as configuring each run for the target GPU.

The real target is to support this with just the run command—so you don’t need to mess with schedulers. But this first pass introduced concurrency with Dask distributed and so exposes the resource based scheduling that Dask offers.

If you have any questions about the steps in Scheduling Runs on Specific GPUs please feel free to followup here. We’ll look for ways to clean this up, either via better docs or tweaks to the code.

CptPirx · April 5, 2021, 7:03am

Ok, thanks for the clarification, I’ll continue assigning the runs manually

Topic		Replies	Views
Dask Scheduler not utilizing all available resources Troubleshooting	0	410	March 3, 2022
Using Dask on a multi-gpu machine General	1	549	April 28, 2021
Distributing runs on a multi-gpu machine General	2	780	October 19, 2022
Assigning runs to GPUs Troubleshooting	4	410	June 12, 2023
Parallel processing with Dask scheduler Guides	0	1407	February 24, 2021

Dask scheduler not using multiple gpus on remote

Related topics