Pytorch Lightning, Distributed Data Parallel & remote

CptPirx · November 3, 2022, 9:21pm

Hi,
I’ve setup my model with Pytorch Lightning, and want to train it on a remote workstation (single machine, multiple GPUs) using multiple GPUs. The recommended strategy to use is DDP.

I’ve successfully run the model locally on 1 GPU, remotely on 1 GPU using DDP and DP strategies, and on multiple GPUs using DP strategy.

However, when running DDP strategy with multiple GPUs on the remote, the processs hangs indefinitely on the first GPU initialization. The only indication in the Guild output that something is not correct is this:

Installing package and its dependencies
Processing ./gpkg.my_package-0.1-py2.py3-none-any.whl
Installing collected packages: gpkg.my_package
Successfully installed gpkg.my_package-0.1
Starting my_model:train on my_remote as 65f7f7f516f54c78900bcd313a6f906c
[some file resolves]
[some PLT info]
 
**guild.op_main: missing required arg**

INFO: [pytorch_lightning.utilities.distributed] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

The process never progresses from this stage. Running nvidia-smi shows no processes on any GPUs, and running htop shows no processes on the CPU.

Below the training part of the script:

# Define the Pytorch Lightning trainer
trainer = pl.Trainer(
        # auto_scale_batch_size='binsearch',
        auto_lr_find=config.auto_lr,
        fast_dev_run=config.fast_dev_run,
        max_epochs=config.epochs,
        accelerator="gpu",
        strategy='ddp',
        devices=config.gpus,
        precision=16,
        callbacks=[
            # pl.callbacks.StochasticWeightAveraging(swa_lrs=1e-2),
            pl.callbacks.EarlyStopping(monitor='val_auc', mode='max'),
            pl.callbacks.LearningRateMonitor()
        ]
)

# Tune the training parameters
trainer.tune(model)
# Train
trainer.fit(model=model)
# Test
trainer.test(model=model)

And the guild.yml file:

- package: gpkg.my_package
  description: My package
  version: 0.1
  data-files:
    - '../datasets/'
    - 'my_model/models/model_definition.py'
    - 'my_model/train.py'
    - 'my_model/models/loss.py'
    - 'my_model/models/memory.py'
    - 'my_model/config.yaml'
    - './training_utils.py'
    - './model_utils.py'

- model: my_model
  description: 
  operations:
    train:
      description: Train my model
      label: "my_model:train - dataset: ${dataset_name}"
      sourcecode:
        - my_model/train.py
        - my_model/models/model_definition.py
        - my_model/models/loss.py
        - my_model/models/memory.py
        - training_utils.py
        - model_utils.py
      requires:
        - config: my_model/config.yaml
        - file: ../datasets/
      main: my_model/train
      flags-dest: config:my_model/config.yaml
      flags-import: all
      flags:
        # Training parameters
        auto_lr: True
        epochs: 2
        fast_dev_run: 100
        gpus: (0,)
        av: False
        # Dataset parameters
        dataset_name: my_dataset
        batch_size: 2
        num_workers: 6
        model_input_size: (256,256)
        data_path: '../../../data/use_case/my_dataset'
        fraction: 1.0
        crop_params: False
        win_len: 16
        # Model parameters
        channels: 3
        mem_dim: 2000
        thresh: 0.0025
        loss_weight: 0.0002
        lr: 10e-5
        wd: 10e-4

Perhaps there is a mistake in the package part? It’s not entirely clear to me if that’s the correct way to do this, and perhaps I’m doing something redundant and/or incorrect that prevents Guild from properly initializing my scripts?

Thanks!

Topic		Replies	Views
Distributed training hanging Troubleshooting	2	3671	August 18, 2022
Guild interfering with distributed resource allocation (pytorch-lightning/ray-tune Troubleshooting	6	398	July 6, 2023
__main__ has no attribute __spec__ pytorch-lightning multiGPU Troubleshooting	5	454	December 14, 2022
Skip "Continue? (Y/n)" General	3	482	July 26, 2021
Guild.ai x Lightning CLI Troubleshooting	6	243	April 4, 2024

Pytorch Lightning, Distributed Data Parallel & remote

Related Topics