For my academic project, I use guild.ai as a lightweight job scheduler on our GPU cluster node.
Following some of the tutorials on my.guild.ai, I am currently using several queues to split tasks to different GPUs on the node.
guild run --background --yes queue gpus=i
then I simply submit tasks using guild run --stage-trials ...
.
(This works amazingly well for other users that need such a functionality!)
The problem: The cluster is automatically re-balancing loads, so the node is subject to restarts.
This would be fine because runs show up as “error” (as far as I understand due to PID mismatch on the restarted node, since the same file system is mounted again). However I am currently unable to reset runs from state “error” to “staged”, preventing me from using it in fully automated manner.
I tried to use the run restart functionality, but it seems incompatible with stage-trials.
Some examples of what I tried:
guild run --restart --stage <run>
guild run --proto --stage <run>
guild run --stage-trials `guild select -Se :`
PS: Thank you for this amazing project!