Guild Stage-Trials Rerun on Error

ratheile · April 16, 2024, 8:00am

For my academic project, I use guild.ai as a lightweight job scheduler on our GPU cluster node.
Following some of the tutorials on my.guild.ai, I am currently using several queues to split tasks to different GPUs on the node.

guild run --background --yes queue gpus=i

then I simply submit tasks using guild run --stage-trials ....

(This works amazingly well for other users that need such a functionality!)

The problem: The cluster is automatically re-balancing loads, so the node is subject to restarts.
This would be fine because runs show up as “error” (as far as I understand due to PID mismatch on the restarted node, since the same file system is mounted again). However I am currently unable to reset runs from state “error” to “staged”, preventing me from using it in fully automated manner.

I tried to use the run restart functionality, but it seems incompatible with stage-trials.

Some examples of what I tried:

guild run --restart --stage <run>
guild run --proto --stage <run>
guild run --stage-trials `guild select -Se :`

PS: Thank you for this amazing project!

garrett · June 18, 2024, 7:17pm

Hello and my apologies for the very late reply! We’ve been working on Gage the past few months and have been very slow in replying here.

The command guild run --proto <run ID> --stage I believe is what you want.

I create a sample project and I think was able to simulate your situation.

The project simply runs this script:

# run.py
import time
time.sleep(99999)

In one terminal I start a queue:

guild run queue

In another terminal I staged the above run.py:

guild run run.py --stage

The queue picks up the staged run after a few second of waiting and runs it.

I forcefully killed the run.py process.

kill -9 `guild select -Fo run.py --attr pid`

In my cases this actually deleted the process lock and so the run shows terminated instead of error. But the net effect should be the same.

At this point, the system resembles what it’d look like after a GPU rebalance - the run was killed and let in an error/terminated state.

Note this just simulated the run termination - the queue is still running. In your case you of course would restart the queue if the GPU rebalance killed it too.

This command got the run going again:

guild run --proto <run ID> --stage

The --proto option tells Guild to start a new run using the specified run as a “prototype” - Guild copies the contents to another run with a new “staged” status. The queue then picked up this staged run and started it.

If you try this and it doesn’t work, please report back here with the details and we’ll troubleshoot further.

Sorry again for the late reply!

Topic		Replies	Views
Stage trials error Troubleshooting	3	550	January 15, 2022
Restarting multiple runs General	9	1470	July 28, 2020
Running jobs show as errors on cluster Troubleshooting	2	485	January 15, 2022
Distributing runs on a multi-gpu machine General	2	775	October 19, 2022
Restart behavior of Guild steps (pipelines) RFC	0	242	March 13, 2023

Guild Stage-Trials Rerun on Error

Related topics