Skip "Continue? (Y/n)"

Celebu · July 3, 2021, 4:29pm

Hello,
Sorry for the very basic-level question. I am trying to run some python code on a shared cluster. I have to submit jobs thorugh a .sh script that generally looks something like:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8                # Request cores (8 per GPU)
#$ -l h_vmem=7.5G           # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0          # Max 1hr runtime (can request up to 240hr)
#$ -m bea	                # Send email on begin, end, abort
#$ -l gpu=1                 # Request 1 GPU
#$ -N baselineResNet18      # Name for the job 

#$ -l gpu_type=volta

# Load the necessary modules
module load python
module load cudnn/8.1.1-cuda11

# Load the virtualenv
source ~/torchgpu/bin/activate

python main.py --arg1 arg1  --arg2 arg2

Now I’ve replaced the last line with

guild run main.py arg1=arg1 arg2=arg2

However, the job will end at the “Continue? (Y/n)” step, since there is no way of interacting at that point. The log literally looks like this:

Variable OMP_NUM_THREADS has been set to 8
Loading cudnn/8.1.1-cuda11
Loading requirement: cuda/11.0.3
Refreshing flags...
You are about to run main.py
  batch_size: 8
  data_dir: /data/home/mypath
  evaluation_dataset: val
  evaluation_epoch: 4
  feature_extract: yes
  learning_rate: 0.001
  mode: train
  model_name: resnet
  model_output_path: /data/home/myoutputpath
  num_classes: 2
  num_epochs: 400
  save_interval: 4
Continue? (Y/n)

How shall I deal with this?

garrett · July 3, 2021, 4:41pm

Hi @Celebu welcome!

Use -y as an command option.

Celebu · July 4, 2021, 9:29am

Thank you!! This works!

If I may add a follow up question: in such a set up (connecting through ssh to the cluster), would one be able to visualise the results?

guild view

does not seem work (it gives a connection error).

garrett · July 26, 2021, 6:04pm

You can start Guild View remotely but you’ll need to connect to the remote server hostname + port that Guild View runs on. Generally the ports are blocked, so the connection error you’re getting is pretty common. Start Guild View on a port that you have network access to. You may need to open access to a port through your firewall - or run Guild View as root on port 80.

Run Guild View on a specific port (e.g. one you have access to):

guild view --port PORT

Then open the link that’s shown on the console.

Topic		Replies	Views
Distributed training hanging Troubleshooting	2	3649	August 18, 2022
Error with async Troubleshooting	0	308	February 23, 2023
Torch.multiprocessing.spawn fails Troubleshooting	2	561	October 17, 2022
Dependecies Problem Troubleshooting	6	863	January 22, 2021
Notebook copying to html error Troubleshooting	2	1278	September 9, 2022

Skip "Continue? (Y/n)"

Related Topics