Guild runs showing as error instead of pending

davzaman · January 12, 2021, 9:15pm

I’ve been using my guild.yml file with no issue so far. I’m not sure what happened but suddenly my pending runs are showing up as errors (no logs, no error messages, but the environment somehow hasn’t been resolved either).
Screen Shot 2021-01-12 at 1.07.45 PM
This behavior also shows on guild runs:

[1:1609e5a4]   imputer   2021-01-12 13:14:13  error      none ckd
[2:9ceadd83]   imputer   2021-01-12 13:14:12  running    none ckd
[3:323f949e]   imputer+  2021-01-12 13:14:12  running    
[4:b9c5ba05]   fo        2021-01-12 13:14:09  running

The pending run eventually ran and succeeded, and the error went away. But it’s unclear why there was an error in the first place. I took a look at guild view and I see this, which looks strange:

ERROR: 128.97.25.29 - - [12/Jan/2021 05:11:03] code 400, message Bad request syntax ('\t\x00\x00\x00\x01\x00\x00\x00ÿþÿ\x0fQ\x00u\x00a\x00l\x00y\x00s\x00P\x00r\x00o\x00b\x00e\x00T\x00e\x00s\x00t\x00\x02\x00\x00\x00')
ERROR: 128.97.25.29 - - [12/Jan/2021 05:11:03] code 400, message Bad request syntax ('ÿÿÿÿ')
ERROR: 128.97.25.29 - - [12/Jan/2021 05:11:48] code 400, message Bad request syntax ('\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
ERROR: 128.97.25.29 - - [12/Jan/2021 05:12:33] code 400, message Bad request syntax ('gqw7')
ERROR: 127.0.0.1 - - [12/Jan/2021 11:06:18] code 400, message Bad request syntax ('\t\x00\x00\x00\x01\x00\x00\x00ÿþÿ\x0fQ\x00u\x00a\x00l\x00y\x00s\x00P\x00r\x00o\x00b\x00e\x00T\x00e\x00s\x00t\x00\x02\x00\x00\x00')
ERROR: 127.0.0.1 - - [12/Jan/2021 11:06:18] code 400, message Bad request syntax ('ÿÿÿÿ')
ERROR: 127.0.0.1 - - [12/Jan/2021 11:07:03] code 400, message Bad request syntax ('\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
ERROR: 127.0.0.1 - - [12/Jan/2021 11:07:48] code 400, message Bad request syntax ('gqw7')

However, when I do a longer run or try to use queues, they never run. They just show as errors.

Not sure how to debug.

garrett · January 12, 2021, 10:43pm

Sorry you’re running into these issues! I suspect the 400s you’re seeing in the View logs are unrelated and represent some other problem.

The status error there is a misnomer - it should read something like unknown. I believe what’s going on is that View and the runs list are catching the run in the middle of initialization. This typically is very fast (milliseconds) so you would not typically catch the run in this state. In your case something is slowing the init down a LOT.

Depending on the number of runs and file count, View can take a long time to update its status. So if it catches the runs in this intermediary state, it could show in View for quite some time. That said, you’re seeing this over and over, so something else is systematically making init MUCH slower than normal.

My first thought with something like this is that your system is in a degraded state. E.g. extremely low memory to trigger swapping or a disk problem that will cause high wait times on disk IO.

But if that’s the case you would probably see major issues all over the place. It’s possible that this problem is showing up for Guild runs because they’re hitting sectors on the disk that are causing problems. I don’t give this idea much credence as it’s a bit of a stretch.

The other area — and more likely the culprit — is that there are inter-process locks that causing run init to wait on other runs. I need to look at the code to see what could be going on.

In the meantime, stop all running Guild queues and see if this problem keeps popping up. To be certain you’re stopping all queues, check your process list:

On Linux:

ps -ef | grep queue_main

On macOS:

ps -aux | grep queue_main

With the queues out of the way, I suspect the issue will clear up. If not, I think it’s something with the environment.

davzaman · January 12, 2021, 11:01pm

No problem, thanks so much for the detailed breakdown!

It turns out you were right. It looks like there were some rogue queue processes floating on my machine. Once I killed them, it fixed!

For background, I have this script:

guild run fo --stage-trials
guild run all-imputers --stage-trials
for _ in `seq 10`; do guild run queue --background -y; done

Then I would monitor my runs (which takes a few hours), and then once it’s done I had another script to kill the queues:

guild stop -o queue
guild runs rm -o queue

I guess something had gone wrong because the queue processes were still alive, so I added the following line to the end of the script: pkill $(ps -ef | grep queue_main | awk '{print $2}')

garrett · January 12, 2021, 11:10pm

That’s helpful input. Guild could provide a command to look for these rogue processes.

I think this is also a case for a proper scheduler — one with a UI to view and report status. Queues are a nice abstraction but as you demonstrate they rely on some higher level coordination.

Topic		Replies	Views
Running jobs show as errors on cluster Troubleshooting	2	486	January 15, 2022
Where to look for error logs Troubleshooting	4	1728	October 5, 2020
Stage trials error Troubleshooting	3	550	January 15, 2022
Guild run hangs / very slow Troubleshooting	2	628	October 8, 2020
Guild check error Troubleshooting	6	624	November 30, 2020

Guild runs showing as error instead of pending

Related topics