Dask scheduler not using multiple gpus on remote

It’s a fresh feature and most likely I’m doing something wrong, but I cannot get the Dask scheduler to work properly on a remote with 4 gpus.

This is what I do, and apart from the remote part it is basically a copy of the steps in the How To guide:

  1. Connect to remote. I have successfully staged runs on remote, run them directly, and also used multiple gpus by assigning runs to gpus manually using --gpus flag. So the remote works corectly.

  2. Start the Dask scheduler on the remote.

  3. Then I stage trials on remote, let’s say 4, with different parameters, using a single command.

guild run TABL:train window=[100,200,300,400] --remote cerberus --stage-trials
  1. The trials are sent to remote, and the scheduler starts them. If workers is set to 4, it starts correctly 4 processes of loading data etc.

And this is where something goes wrong. After doing the pre-procesing stage and creating 4 models concurrently, the scheduler places all 4 models and training processses on all 4 gpus (which can be seen on the screenshot - all gpus have allocated memory). And then only begins the training on 1 gpu.

I would expect the scheduler to assign the trials to available gpus and train them concurrently. Furthermore, when the number of trials is bigger than the number of gpus, I would expect the scheduler to automatically run the pending operation when a gpus becomes free.

Did I understand what the scheduler is capable of correctly? Is there maybe some manual step somewhere that I missed?

Thanks in advance.

By default the scheduler will simply run the staged runs concurrently, up to the number of workers specified. It doesn’t assign runs to GPUs. This is a good feature idea, but the initial pass doesn’t do this.

Instead, you need to assign runs to GPUs explicitly. This uses Dask resources to limit runs, as well as configuring each run for the target GPU.

The real target is to support this with just the run command—so you don’t need to mess with schedulers. But this first pass introduced concurrency with Dask distributed and so exposes the resource based scheduling that Dask offers.

If you have any questions about the steps in Scheduling Runs on Specific GPUs please feel free to followup here. We’ll look for ways to clean this up, either via better docs or tweaks to the code.

Ok, thanks for the clarification, I’ll continue assigning the runs manually :slight_smile: