Status flag is "terminated" when experiment is still "running"

Hi,

I currently run experiments. But their status flag says “terminated” instead of “running”. This is quite annoying since now I can’t delete all terminated runs without also deleting active runs.

Excerpt of output when running guild runs info:

status: terminated
started: 2020-11-25 05:59:38
stopped:

Is it bug or did I cause this somehow?

When an operation is stopped using an interrupt — e.g. by pressing Ctrl-c or via SIGINT or SIGKILL (e.g. used by guild stop) the status is terminated. You can see the exit code in guild runs info.

A terminated run is typically not a problem — there are many reasons you’d want to stop a run.

You can delete terminated runs by filtering with -T (FYI this option is changing to -Ft in 0.7.1). Run this:

guild runs rm -T

I should add that if the operation is in fact still running and shows as terminated, this is a bug.

You can test if the run process is alive by running kill -0 <pid> where <pid> is the pid that shows up in guild runs info.

1 Like

Hi Garrett, my run seems to be terminated my ssh session times out. The run will take a day, so I cant stay on the server for this amount of time

Any chances you are running your jobs on a remote servers or a cluster? In that case, its normal and its not supported now because guild grab the PID for the job status and if won’t be able to get it if you run on a cluster. You can see this issue for more.

You might also consider maintaining a session by using tmux, you can install it through apt-get install tmux. Use tmux to open a persistent session, use tmux ls to list existing sessions and to reconnect a session, use tmux attach -t [session]

1 Like

I run my jobs on a remote cluster. Do you already have a plan on how to fix the status flag for remote jobs? I use guildai quite frequently and happy to submit a PR if you give me some hints on how to best attack this issue.

IMO one elegant way to solve this is to write the status to a file (since cluster most likely have shared file system) and then read it when calling guild compare or guild view, the file should also check if guild is terminated properly. But I am not sure if this aligns with the Devs design pattern.

I just ran a test using a long running operation that looks like this:

import time

time.sleep(1000000)

I start the run this way:

guild run test -r my-remote

I can safely Ctrl-c the session, which disconnects from the remote operation. I can also explicitly kill the underlying ssh command. Either way, the run continues on the remote server. Guild only relies on the ssh connection to start the run — not to actually maintain running. Guild is technically “watching” the run after it starts to avoid the problem you’re mentioning. The watching is just a log tail. You can kill it and not affect the run itself.

Note that when I run this on a remote, the run does not appear in any local runs list until I explicitly pull the run.

When I view the runs on the remote, I see it running — even after I kill the ssh connection.

guild runs -r my-remote
[1:62af7e9e]  gpkg.anonymous-cbedc848/test  2020-11-27 12:27:41  running  

When I pull the run, I get the current run at the point of the pull. When I list runs locally, I see that it’s running along with the remote name.

guild pull my-remote 62af7e9e
guild runs
[1:62af7e9e]  gpkg.anonymous-cbedc848/test  2020-11-27 12:27:41  running (my-remote)  

In this case, Guild reflects the status at the time of the pull. Guild does not automatically sync the status in the background (Guild doesn’t use long-running agents unless you explicitly start them). To get the latest from the remote, run pull again.

Guild has sync command that’s convenient for sync’ing local runs with their remote counterparts. Unfortunately that command is fubar’d in the 0.7.0 release. That’s fixed for 0.7.1 though. You just run guild sync and any local runs that are still running are sync’d with the current remote status.

From my end, aside from the broken sync command (which you don’t need anyway), this is working as expected. To help track down the issue, could you identify the stage where it breaks down for you?

thank you this works well!

1 Like

You are right that works. For me and I guess a few others as well, the problem is that I submit jobs to a remote cluster to which I can’t ssh into. So I can’t pull new updates about the run.
But it’s not that big of a problem just a bit inconvenient.

How do you access files from that server? Do you use a networked file system, locally mounted?