Running jobs show as errors on cluster

Gilded · August 6, 2021, 7:33pm

Not sure if this is a bug or a feature request, but here is the issue: I stage a bunch of jobs on computer 1, part of a cluster with a shared file system. When I call guild runs on any computer in the cluster, it will properly show all staged files. If I start a queue on computer 1 it will promptly launch the first staged job:

If I then log into computer 2, the running job will show as error:

Why would it show as error instead of just “running”?

garrett · January 14, 2022, 4:34pm

Guild determines if a run is “running” by looking at the pid file and checking for a running pid. There are two major problems here:

A run terminated in a way that leaves its pid file (e.g. power loss, SIGKILL, etc) can later show up as “running” when that pid is recycled
When checking status on a shared file system, the local process sees a different list of pids

Guild needs to differentiate remote runs from local when checking this status. To get actual bona fide status, Guild needs an interface to the remote system, which it has, but this requires a guild pull or guild sync op before running guild runs.

We’ll take a look at this for either 0.7.5 or 0.8. Thanks for the report and sorry for the super long response time!

Gilded · January 15, 2022, 6:05pm

No worries, and thanks for the explanation!

Topic		Replies	Views
Stage trials error Troubleshooting	3	550	January 15, 2022
How to view runs on all remotes? Troubleshooting	2	587	March 1, 2021
Guild runs showing as error instead of pending Troubleshooting	3	667	January 12, 2021
Guild Stage-Trials Rerun on Error Troubleshooting	1	98	June 18, 2024
Status flag is "terminated" when experiment is still "running" Troubleshooting	10	583	December 3, 2020

Running jobs show as errors on cluster

Related topics