Distributed remote workflow question

Hi Garret and fellow users of Guild AI,

I’m a prospective user; I had a question that relates to Exporting guild runs from remote sever.

My current workflow looks something like this:

  1. Prep data on machine A
  2. [Map step] Download prepped data on each of the virtual machines 1, 2, …, n, and do embarrassingly parallel things with this data. In particular, since the VMs here are google cloud TPU VMs, this is done by calling the gcloud ssh client and passing the appropriate start and end indices of the data for processing to a python script on each of the VMs.
  3. [Reduce step] Concat the processed data and maybe do some further processing on machine B, where machine B need not even be in the same network as machine A.

I’m wondering what the best way to incorporate Guild AI into this presumably not-uncommon workflow is; in particular, I would like to minimize the bookkeeping and automate things as much as possible, since I’ll want to repeat all of this over and over again several times with different rows of the source data. (And note, again, that the machine on which the reduction step is done may not be the same as any of the other machines in the previous steps; indeed, it may not even be on the same network.)

From the docs, and from looking at the forums, it looks like one way would be to guild push the processed output to an external storage thing like a GCS bucket, and then guild pull from the GCS bucket (or maybe guild copy?) on the reduction step. To make things more automated, the reduction step could be automatically triggered when the mapping stage successfully finishes.

And I guess another way might be to use something like Ray to orchestrate the whole thing from one VM, in which case the guild part of the picture will be a lot simpler.

Does this sound right to you? Apologies in advance for the long post; I’ll be very happy to share my code / write up a mini-tutorial if I manage to get this working.

Sorry for the late reply here!

I’m not sure what the integration point to Ray would be, but that could be an approach. You’re moving the init of the environment and orchestration of the tasks over to that framework.

If you’re using EC2, you can use Guild’s support for starting server instances in that cloud environment. There was a recent contribution that, I believe, adds support for Azure. Adding support for Google Cloud should not be terribly difficult, but that would be a strong point of friction if you need to run in that cloud.

I’d recommend a bash or Python script to orchestrate the Guild commands.

The gist is this:

  • Use Guild to start server instances (e.g. in EC2) using guild remote start - N for the data processing nodes and 1 for the reduction node. These would all have to be explicitly defined in the user config as Guild does not support clusters at this time.
  • Once each data processing node is started, run remote commands from a local project to kick off the data processing tasks. Each operation would require the index offsets per node along with any other hyperparameters used by the operations.
  • Poll the data processing nodes for status and pull results down using guild pull. This should probably be done by the reduction node, so it would be run as a remote operation. I.e. this polling, pulling, and processing operation is defined in your Guild file and run remotely on the reduction host.
  • Once the tasks are completed, shutdown the servers using guild remote stop.

This is a very common workflow, where servers process work and a “sink” server is used to collect and process results. You can orchestrate this any number of ways but the pattern is the same. You could easily enough bypass Guild’s remote features and orchestrate this completely with calls to the cloud API/CLI to start and stop the servers, ssh commands to run remote commands, and rsync or scp commands to move files around. If you use a framework for orchestration, something equivalent will happen.

I realize this is a very late response and you’re probably well past your starting point—my apologies again for getting back so late! If you have any additional questions, please feel free to post. I’ll get back to you much faster!

Thanks for this very informative post!

I did end up just doing this with another experiment tracking solution, though I think I prefer (what I gather is) GuildAI’s philosophy and approach; for example, the other experiment tracking solution doesn’t seem to copy all the code files (at least not by default), and adding the requisite boilerplate code felt like a pain.

If GuildAI doesn’t already work well with distributed gcloud TPUs, I think it might be worth working on that. Right now gcloud TPUs seem to be what people are using for distributedly training massive models. But then again, maybe it wouldn’t make sense to spend too much time on features that only those working with massive models will use!

1 Like