Hyperparameter optimizer distributed over several remotes machines

Is it possible to run an optimizer over more than one remote machine?

I understand we can issue something like guild run train learning-rate[0.01,0.1] batch-size=[10,100] --remote remote-machine-1 and have the all optimizer run in a single remote machine. But, let’s say I have 10 remote machines, I would like the same optimizer to be running over the 10 machines with different parameters in each of them.

Unfortunately Guild doesn’t support distributed runs across machines. As of today, this requires manual scripting:

  • Break up your training operations into runs-per-machine
  • Start the runs remotely on each applicable machine
  • Use guild pull to sync the work performed on each machine to your local machine or another system for consolidation (sometimes referred to as a sink in this case)

While this is a bit of a pain, it’s not terribly complicated. We are looking at integration with distributed systems like Kubernetes (likely via Kubeflow), Dask, Ray, Airflow, etc. But those schemes all bring certain complexity with them.

I’ll reference this topic though in our designs for this feature. I agree it would be very handy to support something like this:

guild run <op> --remote <some cluster spec>

Sorry about that!

1 Like

Thank you for the comprehensive response Garrett, I will try that.

1 Like