Running jobs show as errors on cluster

Not sure if this is a bug or a feature request, but here is the issue: I stage a bunch of jobs on computer 1, part of a cluster with a shared file system. When I call guild runs on any computer in the cluster, it will properly show all staged files. If I start a queue on computer 1 it will promptly launch the first staged job:

If I then log into computer 2, the running job will show as error:

Why would it show as error instead of just “running”?

Guild determines if a run is “running” by looking at the pid file and checking for a running pid. There are two major problems here:

  • A run terminated in a way that leaves its pid file (e.g. power loss, SIGKILL, etc) can later show up as “running” when that pid is recycled
  • When checking status on a shared file system, the local process sees a different list of pids

Guild needs to differentiate remote runs from local when checking this status. To get actual bona fide status, Guild needs an interface to the remote system, which it has, but this requires a guild pull or guild sync op before running guild runs.

We’ll take a look at this for either 0.7.5 or 0.8. Thanks for the report and sorry for the super long response time!

No worries, and thanks for the explanation!