Remote connection error with jump host

Hi,
my institution has recently changed the configuration of our remote workstations.

Now the connection goes through a jump host, and we cannot use a ssh pair here. I have a proxy configured, so manually I connect to the workstation with ‘ssh [workstation]’. The jump host requires a password on every connection, followed by an app authentication. The workstation has a ssh pairing with my local machine, so I only have to login to the jump host. That’s the policy, and cannot be changed.

I have successfully manged to run a guild check on that remote. I have configured a training script, config files etc. so that it all runs smoothly locally.

However, when I try to run the train operation on the remote, I get the following errors:

Initializing remote run
Password: 
Copying package
Password: 
Connection timed out during banner exchange
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
Traceback (most recent call last):
  File "/home/bleporowski/anaconda3/envs/marvel/bin/guild", line 8, in <module>
    sys.exit(main())
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/main_bootstrap.py", line 40, in main
    _main()
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/main_bootstrap.py", line 66, in _main
    guild.main.main()
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/main.py", line 33, in main
    main_cmd.main(standalone_mode=False)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/click_util.py", line 213, in fn
    return fn0(*(args + (Args(**kw),)))
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run.py", line 649, in run
    run_impl.main(args)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 1514, in main
    _dispatch_op(S)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 1610, in _dispatch_op
    _dispatch_op_cmd(S)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 1797, in _dispatch_op_cmd
    _confirm_and_run(S)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 1874, in _confirm_and_run
    _run(S)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 2075, in _run
    _run_remote(S)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/run_impl.py", line 2082, in _run_remote
    remote_impl_support.run(_remote_args(S))
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/commands/remote_impl_support.py", line 125, in run
    run_id = remote.run_op(**_run_kw(args))
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/remotes/ssh.py", line 243, in run_op
    remote_run_dir = self._init_remote_run(tmp.path, opspec, restart)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/remotes/ssh.py", line 265, in _init_remote_run
    self._copy_package_dist(package_dist_dir, remote_run_dir)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/remotes/ssh.py", line 330, in _copy_package_dist
    ssh_util.rsync_copy_to(
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/site-packages/guild/remotes/ssh_util.py", line 129, in rsync_copy_to
    subprocess.check_call(cmd)
  File "/home/bleporowski/anaconda3/envs/marvel/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['rsync', '-vr', '-e', "ssh -oConnectTimeout=10 -o 'ProxyCommand ssh -oConnectTimeout=100  -W %h:%p [user]@[jumphost]'", '/tmp/guild-remote-stage-ahx9az7p/', '[user]@[workstation]:~/anaconda3/envs/time-gop/.guild/runs/5d5d24d410c648f897630ef102538a1e/.guild/job-packages/']' returned non-zero exit status 255.

I’m curious about two things:

  • Why do I have to login twice, once after ‘Initializing remote run’ log, and then again after ‘Copying package’ log?
  • I have set up my remotes in the guild/config.yml to have a timeout of 100 seconds for both the jump host and the second step connection. However, from the trace it seems that the guild/config.yml timeout is not properly read?

This is the guild/config.yml:

remotes:
 [remote-name]:
  type: ssh
  host: [workstation]
  proxy: ssh -oConnectTimeout=100 -W %h:%p [user]@[jump host]
  connect-time: 100
  user: [user]
  conda-env: ~/anaconda3/envs/time-gop  
  init: source ~/anaconda3/etc/profile.d/conda.sh | guild -H ~/projects/protime-gop

the obvious reason would be that the connection times out, as per the error log. However, the config timeout value doesn’t seem to actually change the value invoked with the remote command.

Have I made a mistake while creating my guild/config.yml? Or it is a bug? Or maybe some other reason?

Can you try running rsync from your local system to the jump host? We should replicate the problem outside of Guild.

You’re being asked for password auth multiple times because Guild can establish multiple ssh connections over the course of a run command (e.g. ssh, rsync, ssh, etc.) If you can cache the auth session for the jumphost to avoid the multiple challenges, that would be ideal.

Connecting with rsync to the workstation, with the following command, works without errors:

rsync guild.yml [user]@[workstation]:

and the file is uploaded to the workstation.

Interestingly, while ssh and guild require logging in to the jump host, rsync seems to bypass that somehow, and I copy the file from my local machine to the workstation without password prompt and authorization. I have very limited knowledge in this area, but I think it’s a bit weird.

Running the same command with the jumphost:

rsync guild.yml [user]@[jump host]:

gives no errors, and again I do not have to login. However, this time the file is not copied to the jump host.

Edit:
I believe that the issues is really this part of the command that guild sends:

subprocess.CalledProcessError: Command '['rsync', '-vr', '-e', "ssh -oConnectTimeout=10

It seems to me that with the multiple authorizations that guild requires, those 10 seconds are too short. But as I mentioned before, setting connect-time in the guild config doesn’t seem to influence the ‘oConnectTimeout’ parameter.

Are you seeing a delay of 10 seconds or more when you run rsync manually? This would be an easy fix if it’s just a matter of bumping up the timeout value.

When using rsync manually, no. Probabably because somehow it caches/avoids authorisation, so there is no real possibility for delay.

When running on remote from Guild, everytime I’m prompted for the password I have to wait until the authorization app fires, and then authorize on my phone. The error occurs every time on the second password prompt, after ‘copying package’, hence my suspicion that the 10 second timeout here is to blame.

Ah that makes sense. I’d be curious if you can get the run to complete by bumping this timeout to at least let you auth and get past your current blocker.

The real solution is to fix the auth problem. This is likely an environment difference between the shell you’re using to run the commands manually and the OS process that Guild is using to run the same commands. It’d be nice to know how the system is caching the auth state from your shell.

For the temporary workaround (getting a higher timeout), you can set the connect-timeout attribute of the remote to something like 60.

remotes:
  your-remote:
    connect-timeout: 100

One problem though - there’s a bug in the current release where that value isn’t passed along to one of the rsync calls. I just committed a fix for this to the master branch in GitHub. If you run Guild from source, you’ll get this fix.