I’ve tested the --stage-trials feature with queues today, all worked great. But now I call guild run op--staged-trials, and check with guild runs, some runs are staged, others show errors
A minute later I try guild runs again, and all runs show as error without me having done anything. What could cause the staged trials to flip to errors all of a sudden? This always happens now.
I believe I was able to fix this. After running ps -u username, I found there were some zombie guild and python processes still running. After killing them, the staging immediately worked again. This may or may not be related to me using queue the day before. I did stop them with guild stop -Fo queue, but it may not have completely killed the processes.
Yeah that sounds like a queue that is still running — cleaning up the errant processes is the right solution there.
Guild uses runs to track these queue processes — this is why they’re run as runs and not some other Guild process, which would otherwise disappear from view when run. Runs are all represented on disk (the run dir) and tied to OS processes via pid and lock files. This is standard operating procedure for process orchestration.
Guild could use some housekeeping functions, for example, scanning for orphaned processes and providing a way to forceably stop them.