Hi,
It’s a fresh feature and most likely I’m doing something wrong, but I cannot get the Dask scheduler to work properly on a remote with 4 gpus.
This is what I do, and apart from the remote part it is basically a copy of the steps in the How To guide:
-
Connect to remote. I have successfully staged runs on remote, run them directly, and also used multiple gpus by assigning runs to gpus manually using --gpus flag. So the remote works corectly.
-
Start the Dask scheduler on the remote.
-
Then I stage trials on remote, let’s say 4, with different parameters, using a single command.
guild run TABL:train window=[100,200,300,400] --remote cerberus --stage-trials
- The trials are sent to remote, and the scheduler starts them. If workers is set to 4, it starts correctly 4 processes of loading data etc.
And this is where something goes wrong. After doing the pre-procesing stage and creating 4 models concurrently, the scheduler places all 4 models and training processses on all 4 gpus (which can be seen on the screenshot - all gpus have allocated memory). And then only begins the training on 1 gpu.
Expected
I would expect the scheduler to assign the trials to available gpus and train them concurrently. Furthermore, when the number of trials is bigger than the number of gpus, I would expect the scheduler to automatically run the pending operation when a gpus becomes free.
Did I understand what the scheduler is capable of correctly? Is there maybe some manual step somewhere that I missed?
Thanks in advance.