Is it possible to run an optimizer over more than one remote machine?
I understand we can issue something like guild run train learning-rate[0.01,0.1] batch-size=[10,100] --remote remote-machine-1
and have the all optimizer run in a single remote machine. But, let’s say I have 10 remote machines, I would like the same optimizer to be running over the 10 machines with different parameters in each of them.
Unfortunately Guild doesn’t support distributed runs across machines. As of today, this requires manual scripting:
- Break up your training operations into runs-per-machine
- Start the runs remotely on each applicable machine
- Use
guild pull
to sync the work performed on each machine to your local machine or another system for consolidation (sometimes referred to as a sink in this case)
While this is a bit of a pain, it’s not terribly complicated. We are looking at integration with distributed systems like Kubernetes (likely via Kubeflow), Dask, Ray, Airflow, etc. But those schemes all bring certain complexity with them.
I’ll reference this topic though in our designs for this feature. I agree it would be very handy to support something like this:
guild run <op> --remote <some cluster spec>
Sorry about that!
1 Like
Thank you for the comprehensive response Garrett, I will try that.
1 Like
I would also like this. And it seems like it should be possible to implement in a way that is independent of queuing system.
I am happy to do some manual scripting, but it doesn’t seem easily possible to distribute work across a fixed number of workers without using the filesystem as a queue. In principle, guild run queue
would be enough it had some additional flags to ensure that workers don’t grab the same work item. For example k8s now has a sequence job that gives each worker a number. It should be possible to make a guild run queue --rank i --num-workers 10
grab independent tasks (e.g. based on the hash of the run).
Many systems (e.g. slurm array jobs, k8s jobs, etc) can support this rudimentary form of work distribution.
Hi @nbren12 - sorry for the late reply!
Guild uses a file locking scheme to address contentions across queues. Locks are written to $GUILD_HOME/locks
— if this location is shared across processes (e.g. running on different nodes) queues should not attempt to start the same queued run. If they do that’s a bug. (This behavior is under test but for a locally mounted file system on a single node.)
Queues support a scheme for handling runs targeting a particular GPU, which lets you serialize runs per GPU. This could be used as a workaround per your request, but it has the side effect of setting CUDA_VISIBLE_DEVICES
per run, which is not something I’m comfortable recommending.
If I understand your request, Guild queues should support some method of selecting runs that match some criteria, otherwise they leave the staged run alone. E.g. tags or labels are logical candidates for associating a run with a queue.
Sorry again for getting back to you so late!