How do you use GuildAI with Slurm/remote jobs?

I have little to add to @teracamo’s replies, which I think cover the topics quite well.

To reiterate I think some key points (already covered above but worth underscoring):

You don’t need to modify guild.yml to change flag values. There is deep functionality in Guild to let you drive experiments without changes to guild.yml or any other project file. Each time you have to manually modify your code, just to run an alternative version of an experiment (e.g. different learning rates, etc.) this slows you down a lot. Guild also supports hyperparameter optimization runs, which suggest flag values for you based on previous runs. This would be impossibly slow if you had to modify files along the way.

Guild saves the flag values that are actually used for a run separately from your config files so you always know what actually ran. These can be viewed various ways (e.g. guild runs info, in guild view, in guild compare, etc.)

If you want to version your experiments in source control, you can certainly modify guild.yml for each run. This is the pattern required by experiment tracking tools that rely on version control (i.e. git, etc.) to track experiments. Guild does not require this. Guild lets you separate experimentation, where you don’t know the outcome, from source code revisions, where you checkpoint a project in a stable state. If you want to track flag values that way, consider defining your flags in config files that are separate from guild.yml. This example illustrates the three formats that Guild supports for config file based flags (YAML, JSON, and INI).

In your use of slurm, if you are deferring execution via a latent batch operation, consider staging your runs like this:

guild run <operation and flag vals> --stage -y

If you’re running a batch (e.g. by specifying multiple flag values as per @davzaman’s example above) you can use --stage-trials instead of --stage.

Then, your slurm command could use a queue or Dask scheduler to start your staged runs. Use the run-once flag to cause the queue/scheduler to run once, which will start all your staged runs, wait for them to finish, and then exit.

guild run queue run-once=yes -y

Or, to run in parallel with Dask:

guild run dask:scheduler run-once=yes -y

This approach lets you fully stage operations that are not dependent on you project. This way you can freely modify anything you want without worrying about the changes accidentally impacting staged runs.

2 Likes