Stage trials error

Gilded · July 30, 2021, 1:38am

I’ve tested the --stage-trials feature with queues today, all worked great. But now I call guild run op--staged-trials, and check with guild runs, some runs are staged, others show errors

A minute later I try guild runs again, and all runs show as error without me having done anything. What could cause the staged trials to flip to errors all of a sudden? This always happens now.

Gilded · July 30, 2021, 5:29pm

I believe I was able to fix this. After running ps -u username, I found there were some zombie guild and python processes still running. After killing them, the staging immediately worked again. This may or may not be related to me using queue the day before. I did stop them with guild stop -Fo queue, but it may not have completely killed the processes.

garrett · January 14, 2022, 4:58pm

Yeah that sounds like a queue that is still running — cleaning up the errant processes is the right solution there.

Guild uses runs to track these queue processes — this is why they’re run as runs and not some other Guild process, which would otherwise disappear from view when run. Runs are all represented on disk (the run dir) and tied to OS processes via pid and lock files. This is standard operating procedure for process orchestration.

Guild could use some housekeeping functions, for example, scanning for orphaned processes and providing a way to forceably stop them.

Gilded · January 15, 2022, 6:08pm

Yeah I think some sort of automatic housekeeping would be great as this seems to be a common issue. Thanks!

Topic		Replies	Views
Guild Stage-Trials Rerun on Error Troubleshooting	1	95	June 18, 2024
Running jobs show as errors on cluster Troubleshooting	2	485	January 15, 2022
Where to look for error logs Troubleshooting	4	1728	October 5, 2020
Guild runs showing as error instead of pending Troubleshooting	3	666	January 12, 2021
Guild run hangs / very slow Troubleshooting	2	625	October 8, 2020

Stage trials error

Related topics