How do you use GuildAI with Slurm/remote jobs?


I am contemplating of switching from a custom build experiment tracking system to Guild but I am unsure if it’s worth it. I run all my experiments on a remote slurm cluster.

Right now, I create a new folder for each batch of experiments which contains configuration files for each experiment. I have to create a new file for each experiment since I don’t know when the experiment will start. Always the most recent version of a configuration file is used when the job finally starts. I think I will have to do the same with GuildAI.

What I would like to have is only one config file which I can change, start a job, change it again and then start another job.

What’s your setup with GuildAI and slurm?

I think guild pretty much does what you were saying automatically. Each run you submit opened a separate folder regardless of whether your config file changes.

My setup was pretty simple, I install guild in the headnode, write the guild.yml project file, and I submit the job to the cluster by srun [resources need] guild run [operation] -y -q [other options] &, then it will run quietly in the background (remeber use srun in a persistent shell session, such as using tmux).

Guild can also spare you the trouble with modifying the config file if you specify the hyparameters in guild.yml correctly, you can either use guild’s internal options, say you want to test and tune the learning rate lr:

srun -c 12 --gres=gpu:4 --mem=100G --qos=default guild run -q -y train lr=[1E-4,1E-5,1E-6] &

or you can use bash for loop to queue multiple slurm jobs if you think the slurm walltime restriction can’t cater too many runs:

for x in 1E-4 1E-5 1E-6
  srun -c 12 --gres=gpu:4 --mem=100G --qos=default guild run -q -y train lr=${x} &

Currently guild worked almost perfectly with slurm except it couldn’t grab the running status because your jobs will be running in the childnode instead of the headnode. But other than that, its has been working pretty good for me.

1 Like

Why don’t you submit jobs with sbatch? That way you don’t need to keep the connection open.

When you use a guild.yml file the issue I was talking about remains. Since slurm uses the most recent version of your guild.yml file when the job starts and not when it’s submitted.

Well i used srun because I want to spare the trouble of deleting the stdout and stderr files. Its alright to use sbatch, in fact some times you need sbatch because because srun doesn’t respects .bashrc.

For your second question, may I ask what kind of changes were you making to the guild.yml file?

Let’s say I want to start two jobs:

  1. job: learning_rate: 0.1
  2. job: learning_rate: 0.01

When the learning_rate is configured in a guild.yml and the both job are not started immediately then both will end up with the same learning_rate param.

So I guess I would need to create two separate guild,yml files.

In this case, I think you do not need to modify the config files if you put these parameters under the sub section flags so your yml look lie this:

    description: Train a model.
    main: main
        learning_rate: 0.1 # values put here is overrided if you specify it in command line.
        loss/loss: 'EpochLoss: (\value)'
        perf/accuracy: 'ACC: (\value)'
        loss/validation_loss: 'VAL: (\value)'

        root: ../
            - '*.py'
            - '*.pyx'
            - '*.txt'
            - '*.so'
        - file: '../Data'

without changing the guild.yml, you can simply do:

srun ... guild run -q -y learning_rate=0.1 &
srun ... guild run -q -y learning_rate=0.01 &
srun ... guild run -q -y learning_rate=0.001 &

To submit three jobs that uses the learning_rate 0.1, 0.01 and 0.001 respectively.
This is the intended use of guild and the values you specify overrides whatever values you wrote in the config file. Values in the config file are more like the default value for the flag only.

If you feel like running them in the same slurm job, you can use the square bracket list syntax:

srun ... guild run -q -y learning_rate=[0.1,0.01,0.001] &

Basically, you don’t have to create new guild.yml to change anything under section flags


I have little to add to @teracamo’s replies, which I think cover the topics quite well.

To reiterate I think some key points (already covered above but worth underscoring):

You don’t need to modify guild.yml to change flag values. There is deep functionality in Guild to let you drive experiments without changes to guild.yml or any other project file. Each time you have to manually modify your code, just to run an alternative version of an experiment (e.g. different learning rates, etc.) this slows you down a lot. Guild also supports hyperparameter optimization runs, which suggest flag values for you based on previous runs. This would be impossibly slow if you had to modify files along the way.

Guild saves the flag values that are actually used for a run separately from your config files so you always know what actually ran. These can be viewed various ways (e.g. guild runs info, in guild view, in guild compare, etc.)

If you want to version your experiments in source control, you can certainly modify guild.yml for each run. This is the pattern required by experiment tracking tools that rely on version control (i.e. git, etc.) to track experiments. Guild does not require this. Guild lets you separate experimentation, where you don’t know the outcome, from source code revisions, where you checkpoint a project in a stable state. If you want to track flag values that way, consider defining your flags in config files that are separate from guild.yml. This example illustrates the three formats that Guild supports for config file based flags (YAML, JSON, and INI).

In your use of slurm, if you are deferring execution via a latent batch operation, consider staging your runs like this:

guild run <operation and flag vals> --stage -y

If you’re running a batch (e.g. by specifying multiple flag values as per @davzaman’s example above) you can use --stage-trials instead of --stage.

Then, your slurm command could use a queue or Dask scheduler to start your staged runs. Use the run-once flag to cause the queue/scheduler to run once, which will start all your staged runs, wait for them to finish, and then exit.

guild run queue run-once=yes -y

Or, to run in parallel with Dask:

guild run dask:scheduler run-once=yes -y

This approach lets you fully stage operations that are not dependent on you project. This way you can freely modify anything you want without worrying about the changes accidentally impacting staged runs.

1 Like