I just ran a test using a long running operation that looks like this:
import time
time.sleep(1000000)
I start the run this way:
guild run test -r my-remote
I can safely Ctrl-c
the session, which disconnects from the remote operation. I can also explicitly kill the underlying ssh
command. Either way, the run continues on the remote server. Guild only relies on the ssh
connection to start the run — not to actually maintain running. Guild is technically “watching” the run after it starts to avoid the problem you’re mentioning. The watching is just a log tail. You can kill it and not affect the run itself.
Note that when I run this on a remote, the run does not appear in any local runs list until I explicitly pull the run.
When I view the runs on the remote, I see it running — even after I kill the ssh connection.
guild runs -r my-remote
[1:62af7e9e] gpkg.anonymous-cbedc848/test 2020-11-27 12:27:41 running
When I pull the run, I get the current run at the point of the pull. When I list runs locally, I see that it’s running along with the remote name.
guild pull my-remote 62af7e9e
guild runs
[1:62af7e9e] gpkg.anonymous-cbedc848/test 2020-11-27 12:27:41 running (my-remote)
In this case, Guild reflects the status at the time of the pull. Guild does not automatically sync the status in the background (Guild doesn’t use long-running agents unless you explicitly start them). To get the latest from the remote, run pull
again.
Guild has sync
command that’s convenient for sync’ing local runs with their remote counterparts. Unfortunately that command is fubar’d in the 0.7.0 release. That’s fixed for 0.7.1 though. You just run guild sync
and any local runs that are still running are sync’d with the current remote status.
From my end, aside from the broken sync
command (which you don’t need anyway), this is working as expected. To help track down the issue, could you identify the stage where it breaks down for you?