Hello,
Sorry for the very basic-level question. I am trying to run some python code on a shared cluster. I have to submit jobs thorugh a .sh script that generally looks something like:
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8 # Request cores (8 per GPU)
#$ -l h_vmem=7.5G # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0 # Max 1hr runtime (can request up to 240hr)
#$ -m bea # Send email on begin, end, abort
#$ -l gpu=1 # Request 1 GPU
#$ -N baselineResNet18 # Name for the job
#$ -l gpu_type=volta
# Load the necessary modules
module load python
module load cudnn/8.1.1-cuda11
# Load the virtualenv
source ~/torchgpu/bin/activate
python main.py --arg1 arg1 --arg2 arg2
Now I’ve replaced the last line with
guild run main.py arg1=arg1 arg2=arg2
However, the job will end at the “Continue? (Y/n)” step, since there is no way of interacting at that point. The log literally looks like this:
Variable OMP_NUM_THREADS has been set to 8
Loading cudnn/8.1.1-cuda11
Loading requirement: cuda/11.0.3
Refreshing flags...
You are about to run main.py
batch_size: 8
data_dir: /data/home/mypath
evaluation_dataset: val
evaluation_epoch: 4
feature_extract: yes
learning_rate: 0.001
mode: train
model_name: resnet
model_output_path: /data/home/myoutputpath
num_classes: 2
num_epochs: 400
save_interval: 4
Continue? (Y/n)
How shall I deal with this?