Status flag is "terminated" when experiment is still "running"

jan_r · November 25, 2020, 10:30am

Hi,

I currently run experiments. But their status flag says “terminated” instead of “running”. This is quite annoying since now I can’t delete all terminated runs without also deleting active runs.

Excerpt of output when running guild runs info:

status: terminated
started: 2020-11-25 05:59:38
stopped:

Is it bug or did I cause this somehow?

garrett · November 25, 2020, 6:04pm

When an operation is stopped using an interrupt — e.g. by pressing Ctrl-c or via SIGINT or SIGKILL (e.g. used by guild stop) the status is terminated. You can see the exit code in guild runs info.

A terminated run is typically not a problem — there are many reasons you’d want to stop a run.

You can delete terminated runs by filtering with -T (FYI this option is changing to -Ft in 0.7.1). Run this:

guild runs rm -T

garrett · November 25, 2020, 6:07pm

I should add that if the operation is in fact still running and shows as terminated, this is a bug.

You can test if the run process is alive by running kill -0 <pid> where <pid> is the pid that shows up in guild runs info.

shannon_jemina · November 25, 2020, 9:55pm

Hi Garrett, my run seems to be terminated my ssh session times out. The run will take a day, so I cant stay on the server for this amount of time

teracamo · November 26, 2020, 5:36am

Any chances you are running your jobs on a remote servers or a cluster? In that case, its normal and its not supported now because guild grab the PID for the job status and if won’t be able to get it if you run on a cluster. You can see this issue for more.

You might also consider maintaining a session by using tmux, you can install it through apt-get install tmux. Use tmux to open a persistent session, use tmux ls to list existing sessions and to reconnect a session, use tmux attach -t [session]

jan_r · November 26, 2020, 11:08am

I run my jobs on a remote cluster. Do you already have a plan on how to fix the status flag for remote jobs? I use guildai quite frequently and happy to submit a PR if you give me some hints on how to best attack this issue.

teracamo · November 26, 2020, 11:23am

IMO one elegant way to solve this is to write the status to a file (since cluster most likely have shared file system) and then read it when calling guild compare or guild view, the file should also check if guild is terminated properly. But I am not sure if this aligns with the Devs design pattern.

garrett · November 27, 2020, 8:57pm

I just ran a test using a long running operation that looks like this:

import time

time.sleep(1000000)

I start the run this way:

guild run test -r my-remote

I can safely Ctrl-c the session, which disconnects from the remote operation. I can also explicitly kill the underlying ssh command. Either way, the run continues on the remote server. Guild only relies on the ssh connection to start the run — not to actually maintain running. Guild is technically “watching” the run after it starts to avoid the problem you’re mentioning. The watching is just a log tail. You can kill it and not affect the run itself.

Note that when I run this on a remote, the run does not appear in any local runs list until I explicitly pull the run.

When I view the runs on the remote, I see it running — even after I kill the ssh connection.

guild runs -r my-remote

[1:62af7e9e]  gpkg.anonymous-cbedc848/test  2020-11-27 12:27:41  running

When I pull the run, I get the current run at the point of the pull. When I list runs locally, I see that it’s running along with the remote name.

guild pull my-remote 62af7e9e

guild runs

[1:62af7e9e]  gpkg.anonymous-cbedc848/test  2020-11-27 12:27:41  running (my-remote)

In this case, Guild reflects the status at the time of the pull. Guild does not automatically sync the status in the background (Guild doesn’t use long-running agents unless you explicitly start them). To get the latest from the remote, run pull again.

Guild has sync command that’s convenient for sync’ing local runs with their remote counterparts. Unfortunately that command is fubar’d in the 0.7.0 release. That’s fixed for 0.7.1 though. You just run guild sync and any local runs that are still running are sync’d with the current remote status.

From my end, aside from the broken sync command (which you don’t need anyway), this is working as expected. To help track down the issue, could you identify the stage where it breaks down for you?

shannon_jemina · November 30, 2020, 2:06pm

thank you this works well!

jan_r · December 3, 2020, 1:16pm

You are right that works. For me and I guess a few others as well, the problem is that I submit jobs to a remote cluster to which I can’t ssh into. So I can’t pull new updates about the run.
But it’s not that big of a problem just a bit inconvenient.

garrett · December 3, 2020, 3:23pm

How do you access files from that server? Do you use a networked file system, locally mounted?

Topic		Replies	Views
Running jobs show as errors on cluster Troubleshooting	2	486	January 15, 2022
Pause run General	2	443	April 4, 2022
How to view runs on all remotes? Troubleshooting	2	587	March 1, 2021
Flags are not being tracked/captured during a guild run inside a docker container Troubleshooting	2	525	June 17, 2021
Remote stop not working Troubleshooting	2	521	March 31, 2021

Status flag is "terminated" when experiment is still "running"

Related topics