Hi,
I’ve setup my model with Pytorch Lightning, and want to train it on a remote workstation (single machine, multiple GPUs) using multiple GPUs. The recommended strategy to use is DDP.
I’ve successfully run the model locally on 1 GPU, remotely on 1 GPU using DDP and DP strategies, and on multiple GPUs using DP strategy.
However, when running DDP strategy with multiple GPUs on the remote, the processs hangs indefinitely on the first GPU initialization. The only indication in the Guild output that something is not correct is this:
Installing package and its dependencies
Processing ./gpkg.my_package-0.1-py2.py3-none-any.whl
Installing collected packages: gpkg.my_package
Successfully installed gpkg.my_package-0.1
Starting my_model:train on my_remote as 65f7f7f516f54c78900bcd313a6f906c
[some file resolves]
[some PLT info]
**guild.op_main: missing required arg**
INFO: [pytorch_lightning.utilities.distributed] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
The process never progresses from this stage. Running nvidia-smi
shows no processes on any GPUs, and running htop
shows no processes on the CPU.
Below the training part of the script:
# Define the Pytorch Lightning trainer
trainer = pl.Trainer(
# auto_scale_batch_size='binsearch',
auto_lr_find=config.auto_lr,
fast_dev_run=config.fast_dev_run,
max_epochs=config.epochs,
accelerator="gpu",
strategy='ddp',
devices=config.gpus,
precision=16,
callbacks=[
# pl.callbacks.StochasticWeightAveraging(swa_lrs=1e-2),
pl.callbacks.EarlyStopping(monitor='val_auc', mode='max'),
pl.callbacks.LearningRateMonitor()
]
)
# Tune the training parameters
trainer.tune(model)
# Train
trainer.fit(model=model)
# Test
trainer.test(model=model)
And the guild.yml file:
- package: gpkg.my_package
description: My package
version: 0.1
data-files:
- '../datasets/'
- 'my_model/models/model_definition.py'
- 'my_model/train.py'
- 'my_model/models/loss.py'
- 'my_model/models/memory.py'
- 'my_model/config.yaml'
- './training_utils.py'
- './model_utils.py'
- model: my_model
description:
operations:
train:
description: Train my model
label: "my_model:train - dataset: ${dataset_name}"
sourcecode:
- my_model/train.py
- my_model/models/model_definition.py
- my_model/models/loss.py
- my_model/models/memory.py
- training_utils.py
- model_utils.py
requires:
- config: my_model/config.yaml
- file: ../datasets/
main: my_model/train
flags-dest: config:my_model/config.yaml
flags-import: all
flags:
# Training parameters
auto_lr: True
epochs: 2
fast_dev_run: 100
gpus: (0,)
av: False
# Dataset parameters
dataset_name: my_dataset
batch_size: 2
num_workers: 6
model_input_size: (256,256)
data_path: '../../../data/use_case/my_dataset'
fraction: 1.0
crop_params: False
win_len: 16
# Model parameters
channels: 3
mem_dim: 2000
thresh: 0.0025
loss_weight: 0.0002
lr: 10e-5
wd: 10e-4
Perhaps there is a mistake in the package part? It’s not entirely clear to me if that’s the correct way to do this, and perhaps I’m doing something redundant and/or incorrect that prevents Guild from properly initializing my scripts?
Thanks!