Guild interfering with distributed resource allocation (pytorch-lightning/ray-tune

davzaman · June 27, 2023, 9:34pm

I am using both pytorch-lightning (2.0.3) and ray-tune (2.4.0) on Python 3.9.15 for tuning a ML model.
The tuning procedure can be found here.

Essentially I am using the new ray LightningTrainer interface alongside the Tuner. The LightningTrainer takes a ScalingConfig which allocates resources for training. Additionally I use the ASHAScheduler to schedule trials. Just like this I am running into the following error (below) but this is also broken if I wrap the ASHAScheduler with a ResourceChangingScheduler as you’ll see in the code linked above.

If I run the same exact settings but run auopopulus/impute.py from python alone I have no problems, all my resources are correctly managed and training/tuning runs fine. However, if I run through guild, I get this warning:

(RayTrainWorker pid=2355) /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

Essentially, with a system with 32 CPUs and 4 GPUs, I should be able to create 4 processes with 1 GPU and 7 CPUs each (32/4 = 8 -1 CPU for coordination).

However, the system looks resource poor for some reason, saying that I only have 2 workers available instead of 7. Then because for some reason these resources are unaccessible, during training “runs” nothing happens because it cannot move forward with the requested resources. I want to be able to run all 4 in parallel.

This is a very pressing issue for me as I’m on a tight deadline.

Here is a MWE:
mwe.py:

"""MWE following https://docs.ray.io/en/latest/tune/examples/tune-pytorch-lightning.html"""
import os
from typing import Any
from numpy import array, ndarray
from numpy.random import default_rng
from pandas import DataFrame, Index, Series
from pytorch_lightning import LightningDataModule, LightningModule
from pytorch_lightning.loggers import TensorBoardLogger
import torch
import torch.nn as nn
from torch import from_numpy
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
from ray import tune, air
from ray.tune.schedulers import ASHAScheduler
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from ray.train.lightning import LightningTrainer, LightningConfigBuilder

SEED = 239
NUM_TRAIN_SAMPLES = 80
NUM_VAL_SAMPLES = 20
NUM_IN_FEATURES = 10


class MyLightningModule(LightningModule):
    def __init__(
        self,
        lr: float,
        arg_with_index: Index,
        arg_with_Series: Series,
        arg_with_dataframe: DataFrame,
        arg_with_ndarray: ndarray,
    ):
        super().__init__()
        self.save_hyperparameters()
        self.loss = nn.MSELoss()
        self.fc1 = nn.Linear(NUM_IN_FEATURES, 5)
        self.fc2 = nn.Linear(5, NUM_IN_FEATURES)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def training_step(self, train_batch, batch_idx):
        return self.shared_step("train", train_batch, batch_idx)

    def validation_step(self, train_batch, batch_idx):
        return self.shared_step("val", train_batch, batch_idx)

    def shared_step(self, split, batch, batch_idx):
        x, y = batch
        output = self(x)
        loss = self.loss(output, y)
        self.log(f"{split}/loss", loss)
        return loss

    def configure_optimizers(self) -> Any:
        return Adam(self.parameters(), lr=self.hparams.lr)


class MyDataset(Dataset):
    def __init__(self, X, y) -> None:
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index) -> Any:
        return from_numpy(self.X[index]).to(torch.float), from_numpy(self.y[index]).to(
            torch.float
        )


class MyDataModule(LightningDataModule):
    def setup(self, stage=None):
        rng = default_rng(SEED)
        self.train = rng.random((NUM_TRAIN_SAMPLES, NUM_IN_FEATURES))
        self.train_true = rng.random((NUM_TRAIN_SAMPLES, NUM_IN_FEATURES))

        self.val = rng.random((NUM_VAL_SAMPLES, NUM_IN_FEATURES))
        self.val_true = rng.random((NUM_VAL_SAMPLES, NUM_IN_FEATURES))

        self.test = rng.random((NUM_VAL_SAMPLES, NUM_IN_FEATURES))
        self.test_true = rng.random((NUM_VAL_SAMPLES, NUM_IN_FEATURES))

    def train_dataloader(self):
        return DataLoader(
            MyDataset(self.train, self.train_true), batch_size=10, num_workers=7
        )

    def val_dataloader(self):
        return DataLoader(
            MyDataset(self.val, self.val_true), batch_size=10, num_workers=7
        )

    def test_dataloader(self):
        return DataLoader(
            MyDataset(self.test, self.test_true), batch_size=10, num_workers=7
        )


if __name__ == "__main__":
    num_epochs = 5
    num_tuning_samples = 2
    accelerator = "gpu"
    dm = MyDataModule()
    logger = TensorBoardLogger(save_dir=os.getcwd(), name="mwe")

    # Define constant configs and searchable configs separately
    config = {
        "arg_with_index": Index([1, 2, 3, 4]),
        "arg_with_Series": Series(
            [1, 2, 3, 4], index=["a", "b", "c", "d"], name="name"
        ),
        "arg_with_dataframe": DataFrame(
            [[1, 2, 3, 4], [5, 6, 7, 8]],
            index=["a", "b"],
            columns=["col1", "col2", "col3", "col4"],
        ),
        "arg_with_ndarray": array([1, 2, 3, 4]),
    }

    search_config = {
        "lr": tune.loguniform(1e-4, 1e-1),
    }

    lightning_config = (
        LightningConfigBuilder()
        .module(cls=MyLightningModule, **config)
        .trainer(max_epochs=num_epochs, accelerator=accelerator, logger=logger)
        .fit_params(datamodule=dm)
        .checkpointing(monitor="val/loss", save_top_k=1, mode="min")
        .build()
    )

    run_config = RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="val/loss",
            checkpoint_score_order="min",
        ),
    )
    scaling_config = ScalingConfig(
        num_workers=4, use_gpu=True, resources_per_worker={"CPU": 7, "GPU": 1}
    )

    search_lightning_config = LightningConfigBuilder().module(**search_config).build()

    # Initialize your trainer with constant configs first
    lightning_trainer = LightningTrainer(
        lightning_config=lightning_config,
        scaling_config=scaling_config,
        run_config=run_config,
    )

    # Then, feed the search space to Tuner
    tuner = tune.Tuner(
        lightning_trainer,
        param_space={"lightning_config": search_lightning_config},
        tune_config=tune.TuneConfig(
            metric="val/loss",
            mode="min",
            num_samples=num_tuning_samples,
            scheduler=ASHAScheduler(
                max_t=num_epochs,
                grace_period=1,
                reduction_factor=2,
            ),
        ),
        run_config=air.RunConfig(
            name="mwe",
        ),
    )
    results = tuner.fit()

guild.yml:

- operations:
    mwe:
      description: Guild interfering with resource allocation in pytorch-lightning and ray-tune.
      env: # don't save pycache
        PYTHONDONTWRITEBYTECODE: 1
      sourcecode: no
      output-scalars: off
      main: mwe

Then run guild run mwe.

You will see the same message I pasted above:

(RayTrainWorker pid=13071) /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

The training will run however because the number of training samples is relatively low. If you increase it to something like 80,000 you will see how slow it is.

garrett · June 30, 2023, 5:14pm

We’re taking a look at this now. We may need to test this on a system with multiple GPUs to reproduce.

davzaman · June 30, 2023, 6:00pm

Let me know if there’s anything I can do to help.

garrett · July 1, 2023, 1:44pm

We created an issue resolution doc here:

The current status is that we can’t recreate this behavior on a particular test system. The project there details the situation.

If you could review that document and try to use that project to recreate the issue, that would be very helpful! Please run the project operation when you try to recreate. Modify anything you like in that project as needed. If you are able to create a run that shows the problem you can add it to the project so we can see the details.

E.g. from the my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune directory, run:

guild publish <run that shows the issue>

This will add the run to published-runs. Then just submit a PR for the project and we’ll take a look!

davzaman · July 3, 2023, 8:15am

Hi! Thanks for getting back. I recreated it and did some more testing and I was still getting the warning after fiddling.

I published the run and put a PR through.

garrett · July 5, 2023, 5:04pm

Thanks for the published run!

Could you attach here the output you get when you run without Guild for the same flag values (you’ll need to modify the source code. I.e.

python mwe.py  # with workers=1 CPUs=7 GPUs=1

It would be helpful to see how Guild is causing the script to behave differently.

If that doesn’t reproduce the problem, could you run with the applicable flag values (workers, CPUs, GPUs) that shows the problem with Guild?

As an aside, it’s not clear to me how Guild in this case is reflecting degraded/missing resources. Here’s some sample output from the run you published:

github.com

guildai/issue-resolution/blob/master/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/published-runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/README.md?plain=1#L299


      
          ```
          INFO: [torch.distributed.nn.jit.instantiator] Created a temporary directory at /tmp/tmpdcve0m1g
          INFO: [torch.distributed.nn.jit.instantiator] Writing /tmp/tmpdcve0m1g/_remote_module_non_scriptable.py
          INFO: [ray.tune.impl.tuner_internal] A `RunConfig` was passed to both the `Tuner` and the `LightningTrainer`. The run config passed to the `Tuner` is the one that will be used.
          2023-07-03 01:08:22,935	INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
          2023-07-03 01:08:31,960	INFO tune.py:218 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
          == Status ==
          Current time: 2023-07-03 01:08:34 (running for 00:00:02.53)
          Using AsyncHyperBand: num_stopped=0
          Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
          Logical resource usage: 8.0/32 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
          Result logdir: /home/davina/.guild/runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/ray_results/mwe
          Number of trials: 2/2 (1 PENDING, 1 RUNNING)
          +------------------------------+----------+---------------------+------------------------+
          | Trial name                   | status   | loc                 |   lightning_config/_mo |
          |                              |          |                     |    dule_init_config/lr |
          |------------------------------+----------+---------------------+------------------------|
          | LightningTrainer_cce97_00000 | RUNNING  | 131.179.80.122:8205 |             0.0611121  |
          | LightningTrainer_cce97_00001 | PENDING  |                     |             0.00650997 |
          +------------------------------+----------+---------------------+------------------------+

That shows 8.0/32 CPUs, 1.0/4 GPUs, which I think is what you’d expect with 1 worker, 7 CPUs and 1 GPU.

What’s odd is that the output later shows 16/32 CPUs:

github.com

guildai/issue-resolution/blob/master/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/published-runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/README.md?plain=1#L321


      
          [2m[36m(RayTrainWorker pid=8640)[0m 2023-07-03 01:08:37,128	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
          [2m[36m(RayTrainWorker pid=8640)[0m GPU available: True (cuda), used: True
          [2m[36m(RayTrainWorker pid=8640)[0m TPU available: False, using: 0 TPU cores
          [2m[36m(RayTrainWorker pid=8640)[0m IPU available: False, using: 0 IPUs
          [2m[36m(RayTrainWorker pid=8640)[0m HPU available: False, using: 0 HPUs
          [2m[36m(RayTrainWorker pid=8640)[0m Missing logger folder: /home/davina/.guild/runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/mwe
          == Status ==
          Current time: 2023-07-03 01:08:39 (running for 00:00:07.55)
          Using AsyncHyperBand: num_stopped=0
          Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
          Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
          Result logdir: /home/davina/.guild/runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/ray_results/mwe
          Number of trials: 2/2 (2 RUNNING)
          +------------------------------+----------+---------------------+------------------------+
          | Trial name                   | status   | loc                 |   lightning_config/_mo |
          |                              |          |                     |    dule_init_config/lr |
          |------------------------------+----------+---------------------+------------------------|
          | LightningTrainer_cce97_00000 | RUNNING  | 131.179.80.122:8205 |             0.0611121  |
          | LightningTrainer_cce97_00001 | RUNNING  | 131.179.80.122:8642 |             0.00650997 |
          +------------------------------+----------+---------------------+------------------------+

That count continues until toward the end. Finally the count (last status entry) is 0/32 CPU:

github.com

guildai/issue-resolution/blob/master/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/published-runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/README.md?plain=1#L708


      
          |------------------------------+------------+---------------------+------------------------+--------+------------------+--------------+------------+---------|
          | LightningTrainer_cce97_00001 | RUNNING    | 131.179.80.122:8642 |             0.00650997 |      5 |          61.1519 |    0.0915207 |  0.0935725 |       4 |
          | LightningTrainer_cce97_00000 | TERMINATED | 131.179.80.122:8205 |             0.0611121  |      5 |          61.4286 |    0.09561   |  0.0967966 |       4 |
          +------------------------------+------------+---------------------+------------------------+--------+------------------+--------------+------------+---------+
          
          
          == Status ==
          Current time: 2023-07-03 01:09:39 (running for 00:01:07.82)
          Using AsyncHyperBand: num_stopped=2
          Bracket: Iter 4.000: -0.09562098979949951 | Iter 2.000: -0.09711753949522972 | Iter 1.000: -0.09460743144154549
          Logical resource usage: 0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:RTX)
          Current best trial: cce97_00001 with val/loss=0.09357253462076187 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.006509974933809896}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
          Result logdir: /home/davina/.guild/runs/b5e3f4eeb46e4a3b96dde8d3605e97a9/ray_results/mwe
          Number of trials: 2/2 (2 TERMINATED)
          +------------------------------+------------+---------------------+------------------------+--------+------------------+--------------+------------+---------+
          | Trial name                   | status     | loc                 |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
          |                              |            |                     |    dule_init_config/lr |        |                  |              |            |         |
          |------------------------------+------------+---------------------+------------------------+--------+------------------+--------------+------------+---------|
          | LightningTrainer_cce97_00000 | TERMINATED | 131.179.80.122:8205 |             0.0611121  |      5 |          61.4286 |    0.09561   |  0.0967966 |       4 |
          | LightningTrainer_cce97_00001 | TERMINATED | 131.179.80.122:8642 |             0.00650997 |      5 |          61.1519 |    0.0915207 |  0.0935725 |       4 |
          +------------------------------+------------+---------------------+------------------------+--------+------------------+--------------+------------+---------+

I’m not sure what’s going on here — perhaps this is to be expected.

davzaman · July 6, 2023, 3:04am

Here’s the run output of python mwe.py where tune workers =1, dataloader workers =7, cpus=7, gpus=1.

Run Output

2023-07-05 19:40:12,958	INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at e[1me[32m127.0.0.1:8265 e[39me[22m
2023-07-05 19:40:22,037	INFO tune.py:218 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
/home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/ray/tune/experiment/experiment.py:170: UserWarning: The `local_dir` argument of `Experiment is deprecated. Use `storage_path` or set the `TUNE_RESULT_DIR` environment variable instead.
  warnings.warn(
e[2me[36m(RayTrainWorker pid=28498)e[0m 2023-07-05 19:40:27,230	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
e[2me[36m(RayTrainWorker pid=28498)e[0m GPU available: True (cuda), used: True
e[2me[36m(RayTrainWorker pid=28498)e[0m TPU available: False, using: 0 TPU cores
e[2me[36m(RayTrainWorker pid=28498)e[0m IPU available: False, using: 0 IPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
e[2me[36m(RayTrainWorker pid=28498)e[0m 
e[2me[36m(RayTrainWorker pid=28498)e[0m   | Name | Type    | Params
e[2me[36m(RayTrainWorker pid=28498)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28498)e[0m 0 | loss | MSELoss | 0     
e[2me[36m(RayTrainWorker pid=28498)e[0m 1 | fc1  | Linear  | 55    
e[2me[36m(RayTrainWorker pid=28498)e[0m 2 | fc2  | Linear  | 60    
e[2me[36m(RayTrainWorker pid=28498)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28498)e[0m 115       Trainable params
e[2me[36m(RayTrainWorker pid=28498)e[0m 0         Non-trainable params
e[2me[36m(RayTrainWorker pid=28498)e[0m 115       Total params
e[2me[36m(RayTrainWorker pid=28498)e[0m 0.000     Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=28498)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=28498)e[0m   warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log('val/loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=28498)e[0m   warning_cache.warn(
e[2me[36m(RayTrainWorker pid=28754)e[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
e[2me[36m(RayTrainWorker pid=28754)e[0m 
e[2me[36m(RayTrainWorker pid=28754)e[0m   | Name | Type    | Params
e[2me[36m(RayTrainWorker pid=28754)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28754)e[0m 0 | loss | MSELoss | 0     
e[2me[36m(RayTrainWorker pid=28754)e[0m 1 | fc1  | Linear  | 55    
e[2me[36m(RayTrainWorker pid=28754)e[0m 2 | fc2  | Linear  | 60    
e[2me[36m(RayTrainWorker pid=28754)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28754)e[0m 115       Trainable params
e[2me[36m(RayTrainWorker pid=28754)e[0m 0         Non-trainable params
e[2me[36m(RayTrainWorker pid=28754)e[0m 115       Total params
e[2me[36m(RayTrainWorker pid=28754)e[0m 0.000     Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=28754)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=28754)e[0m   warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=28754)e[0m   warning_cache.warn(
e[2me[36m(RayTrainWorker pid=28754)e[0m   warning_cache.warn(
== Status ==
Current time: 2023-07-05 19:40:24 (running for 00:00:02.58)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 8.0/32 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 PENDING, 1 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |
|                              |          |                      |    dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |
| LightningTrainer_740a3_00001 | PENDING  |                      |            0.000597133 |
+------------------------------+----------+----------------------+------------------------+


== Status ==
Current time: 2023-07-05 19:40:29 (running for 00:00:07.60)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |
|                              |          |                      |    dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |
+------------------------------+----------+----------------------+------------------------+


== Status ==
Current time: 2023-07-05 19:40:34 (running for 00:00:12.60)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |
|                              |          |                      |    dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |
+------------------------------+----------+----------------------+------------------------+


== Status ==
Current time: 2023-07-05 19:40:39 (running for 00:00:17.61)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |
|                              |          |                      |    dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |
+------------------------------+----------+----------------------+------------------------+


== Status ==
Current time: 2023-07-05 19:40:44 (running for 00:00:22.61)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |
|                              |          |                      |    dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |
+------------------------------+----------+----------------------+------------------------+


Result for LightningTrainer_740a3_00000:
  _report_on: train_epoch_end
  date: 2023-07-05_19-40-48
  done: false
  epoch: 0
  hostname: lambda2
  iterations_since_restore: 1
  node_ip: 131.179.80.122
  pid: 28144
  should_checkpoint: true
  step: 800
  time_since_restore: 23.41086745262146
  time_this_iter_s: 23.41086745262146
  time_total_s: 23.41086745262146
  timestamp: 1688611247
  train/loss: 0.13250067830085754
  training_iteration: 1
  trial_id: 740a3_00000
  val/loss: 0.11762388795614243
  
== Status ==
Current time: 2023-07-05 19:40:53 (running for 00:00:31.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.11762388795614243
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.11762388795614243 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |          |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |      1 |          23.4109 |     0.132501 |   0.117624 |       0 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |        |                  |              |            |         |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


Result for LightningTrainer_740a3_00001:
  _report_on: train_epoch_end
  date: 2023-07-05_19-40-53
  done: false
  epoch: 0
  hostname: lambda2
  iterations_since_restore: 1
  node_ip: 131.179.80.122
  pid: 28500
  should_checkpoint: true
  step: 800
  time_since_restore: 25.342666387557983
  time_this_iter_s: 25.342666387557983
  time_total_s: 25.342666387557983
  timestamp: 1688611253
  train/loss: 0.10083736479282379
  training_iteration: 1
  trial_id: 740a3_00001
  val/loss: 0.1018548458814621
  
Result for LightningTrainer_740a3_00000:
  _report_on: train_epoch_end
  date: 2023-07-05_19-40-57
  done: false
  epoch: 1
  hostname: lambda2
  iterations_since_restore: 2
  node_ip: 131.179.80.122
  pid: 28144
  should_checkpoint: true
  step: 1600
  time_since_restore: 32.800697326660156
  time_this_iter_s: 9.389829874038696
  time_total_s: 32.800697326660156
  timestamp: 1688611257
  train/loss: 0.08113399147987366
  training_iteration: 2
  trial_id: 740a3_00000
  val/loss: 0.10029216855764389
  
== Status ==
Current time: 2023-07-05 19:41:02 (running for 00:00:40.40)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.10029216855764389 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.10029216855764389 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |          |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |      2 |          32.8007 |     0.081134 |   0.100292 |       1 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |      1 |          25.3427 |     0.100837 |   0.101855 |       0 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


Result for LightningTrainer_740a3_00001:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-03
  done: false
  epoch: 1
  hostname: lambda2
  iterations_since_restore: 2
  node_ip: 131.179.80.122
  pid: 28500
  should_checkpoint: true
  step: 1600
  time_since_restore: 34.708067655563354
  time_this_iter_s: 9.365401268005371
  time_total_s: 34.708067655563354
  timestamp: 1688611263
  train/loss: 0.07746944576501846
  training_iteration: 2
  trial_id: 740a3_00001
  val/loss: 0.09756060689687729
  
Result for LightningTrainer_740a3_00000:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-07
  done: false
  epoch: 2
  hostname: lambda2
  iterations_since_restore: 3
  node_ip: 131.179.80.122
  pid: 28144
  should_checkpoint: true
  step: 2400
  time_since_restore: 42.57479667663574
  time_this_iter_s: 9.774099349975586
  time_total_s: 42.57479667663574
  timestamp: 1688611267
  train/loss: 0.09988719969987869
  training_iteration: 3
  trial_id: 740a3_00000
  val/loss: 0.0968434065580368
  
== Status ==
Current time: 2023-07-05 19:41:12 (running for 00:00:50.18)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.0968434065580368 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |          |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |      3 |          42.5748 |    0.0998872 |  0.0968434 |       2 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |      2 |          34.7081 |    0.0774694 |  0.0975606 |       1 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


Result for LightningTrainer_740a3_00001:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-12
  done: false
  epoch: 2
  hostname: lambda2
  iterations_since_restore: 3
  node_ip: 131.179.80.122
  pid: 28500
  should_checkpoint: true
  step: 2400
  time_since_restore: 44.249186754226685
  time_this_iter_s: 9.54111909866333
  time_total_s: 44.249186754226685
  timestamp: 1688611272
  train/loss: 0.08657821267843246
  training_iteration: 3
  trial_id: 740a3_00001
  val/loss: 0.09475232660770416
  
Result for LightningTrainer_740a3_00000:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-16
  done: false
  epoch: 3
  hostname: lambda2
  iterations_since_restore: 4
  node_ip: 131.179.80.122
  pid: 28144
  should_checkpoint: true
  step: 3200
  time_since_restore: 52.18841528892517
  time_this_iter_s: 9.613618612289429
  time_total_s: 52.18841528892517
  timestamp: 1688611276
  train/loss: 0.09277255833148956
  training_iteration: 4
  trial_id: 740a3_00000
  val/loss: 0.09569665789604187
  
e[2me[36m(RayTrainWorker pid=28498)e[0m `Trainer.fit` stopped: `max_epochs=5` reached.
2023-07-05 19:41:29,815	INFO tune.py:945 -- Total run time: 67.78 seconds (67.74 seconds for the tuning loop).
== Status ==
Current time: 2023-07-05 19:41:21 (running for 00:00:59.78)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.09569665789604187 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09475232660770416 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status   | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |          |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING  | 131.179.80.122:28144 |            0.000236706 |      4 |          52.1884 |    0.0927726 |  0.0956967 |       3 |
| LightningTrainer_740a3_00001 | RUNNING  | 131.179.80.122:28500 |            0.000597133 |      3 |          44.2492 |    0.0865782 |  0.0947523 |       2 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


Result for LightningTrainer_740a3_00001:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-22
  done: false
  epoch: 3
  hostname: lambda2
  iterations_since_restore: 4
  node_ip: 131.179.80.122
  pid: 28500
  should_checkpoint: true
  step: 3200
  time_since_restore: 53.791813373565674
  time_this_iter_s: 9.54262661933899
  time_total_s: 53.791813373565674
  timestamp: 1688611282
  train/loss: 0.09039577096700668
  training_iteration: 4
  trial_id: 740a3_00001
  val/loss: 0.09388325363397598
  
Result for LightningTrainer_740a3_00000:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-26
  done: true
  epoch: 4
  hostname: lambda2
  iterations_since_restore: 5
  node_ip: 131.179.80.122
  pid: 28144
  should_checkpoint: true
  step: 4000
  time_since_restore: 61.73058104515076
  time_this_iter_s: 9.542165756225586
  time_total_s: 61.73058104515076
  timestamp: 1688611286
  train/loss: 0.09823152422904968
  training_iteration: 5
  trial_id: 740a3_00000
  val/loss: 0.09352872520685196
  
Trial LightningTrainer_740a3_00000 completed.
Result for LightningTrainer_740a3_00001:
  _report_on: train_epoch_end
  date: 2023-07-05_19-41-29
  done: true
  epoch: 4
  hostname: lambda2
  iterations_since_restore: 5
  node_ip: 131.179.80.122
  pid: 28500
  should_checkpoint: true
  step: 4000
  time_since_restore: 61.19364666938782
  time_this_iter_s: 7.4018332958221436
  time_total_s: 61.19364666938782
  timestamp: 1688611289
  train/loss: 0.093507781624794
  training_iteration: 5
  trial_id: 740a3_00001
  val/loss: 0.09201841056346893
  
Trial LightningTrainer_740a3_00001 completed.
== Status ==
Current time: 2023-07-05 19:41:29 (running for 00:01:07.73)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09478995576500893 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 8.0/32 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09201841056346893 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 RUNNING, 1 TERMINATED)
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status     | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |            |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00001 | RUNNING    | 131.179.80.122:28500 |            0.000597133 |      5 |          61.1936 |    0.0935078 |  0.0920184 |       4 |
| LightningTrainer_740a3_00000 | TERMINATED | 131.179.80.122:28144 |            0.000236706 |      5 |          61.7306 |    0.0982315 |  0.0935287 |       4 |
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


== Status ==
Current time: 2023-07-05 19:41:29 (running for 00:01:07.75)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09478995576500893 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09201841056346893 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 TERMINATED)
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name                   | status     | loc                  |   lightning_config/_mo |   iter |   total time (s) |   train/loss |   val/loss |   epoch |
|                              |            |                      |    dule_init_config/lr |        |                  |              |            |         |
|------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | TERMINATED | 131.179.80.122:28144 |            0.000236706 |      5 |          61.7306 |    0.0982315 |  0.0935287 |       4 |
| LightningTrainer_740a3_00001 | TERMINATED | 131.179.80.122:28500 |            0.000597133 |      5 |          61.1936 |    0.0935078 |  0.0920184 |       4 |
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+


e[2me[36m(RayTrainWorker pid=28754)e[0m `Trainer.fit` stopped: `max_epochs=5` reached.

This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
So strangely enough it looks like I am still getting this warning, so probably not guild’s fault? What’s confusing to me is I have the same setup for my full project regarding tuning and when I run with python I don’t get that warning but when I run with guild I do (which is why I thought it was a symptom of guild).
However with the MWE I get the warning either way.
I will try to dig for any differences that may be causing this discrepancy in behavior.

The resource missingness I was concerned about revolves less around the reported resource usage and more around that warning for the dataloader. Particularly, it’s strange that if I have 8/32 CPUs used, surely I should have space for 7 workers for the dataloader, yet we see that warning. Much less even when there are more resources being used/requested.

I believe the progression of resources you’re seeing there is intended behavior from ray. Essentially each group of num_workers (1) is assigned to a trial with the requested resources (7 CPUs 1 GPU). Ray will pick up the first trial and assign the 1GPU and 7CPUs + 1 CPU for a head thread that manages everything. Then it sees there are enough resources to run the second trial in parallel, so it then devotes another 1 GPU + 8 CPUs, resulting in 16/32 and 2/4. Then when tuning has ended and all trials are finished, ray doesn’t need any resources anymore 0/32, 0/4.
Here’s the run output of python mwe.py where tune workers =2, dataloader workers =7, cpus=7, gpus=1 (which still has the same warning lol):

Run Output

2023-07-05 19:56:11,115 INFO worker.py:1616 – Started a local Ray instance. View the dashboard at e[1me[32m127.0.0.1:8265 e[39me[22m
2023-07-05 19:56:20,003 INFO tune.py:218 – Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(…)` before `Tuner(…)`.
/home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/ray/tune/experiment/experiment.py:170: UserWarning: The `local_dir` argument of `Experiment is deprecated. Use `storage_path` or set the `TUNE_RESULT_DIR` environment variable instead.
warnings.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m 2023-07-05 19:56:26,425 INFO config.py:86 – Setting up process group for: env:// [rank=0, world_size=2]
e[2me[36m(RayTrainWorker pid=19358)e[0m GPU available: True (cuda), used: True
e[2me[36m(RayTrainWorker pid=19358)e[0m TPU available: False, using: 0 TPU cores
e[2me[36m(RayTrainWorker pid=19358)e[0m IPU available: False, using: 0 IPUs
e[2me[36m(RayTrainWorker pid=19358)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m 2023-07-05 19:56:32,289 INFO config.py:86 – Setting up process group for: env:// [rank=0, world_size=2]
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19361)e[0m LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
e[2me[36m(RayTrainWorker pid=19358)e[0m
e[2me[36m(RayTrainWorker pid=19358)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=19358)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19358)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=19358)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=19358)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=19358)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19358)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=19358)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=19358)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=19358)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=19361)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=19361)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=19358)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log(‘val/loss’, …, sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19695)e[0m LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
e[2me[36m(RayTrainWorker pid=19694)e[0m
e[2me[36m(RayTrainWorker pid=19694)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=19694)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19694)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=19694)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=19694)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=19694)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19694)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=19694)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=19694)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=19694)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=19695)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=19695)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=19694)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log(‘val/loss’, …, sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
== Status ==
Current time: 2023-07-05 19:56:22 (running for 00:00:02.55)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 15.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 PENDING, 1 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | PENDING | | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:28 (running for 00:00:08.00)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:33 (running for 00:00:13.00)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:38 (running for 00:00:18.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:43 (running for 00:00:23.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:48 (running for 00:00:28.02)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:53 (running for 00:00:33.02)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

== Status ==
Current time: 2023-07-05 19:56:58 (running for 00:00:38.03)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+

Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-01
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 400
time_since_restore: 38.75692582130432
time_this_iter_s: 38.75692582130432
time_total_s: 38.75692582130432
timestamp: 1688612220
train/loss: 0.10777071118354797
training_iteration: 1
trial_id: af083_00000
val/loss: 0.10647716373205185

== Status ==
Current time: 2023-07-05 19:57:06 (running for 00:00:46.32)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.10647716373205185
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00000 with val/loss=0.10647716373205185 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0003727097483271039}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 1 | 38.7569 | 0.107771 | 0.106477 | 0 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | | | | | |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

== Status ==
Current time: 2023-07-05 19:57:11 (running for 00:00:51.33)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.10647716373205185
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00000 with val/loss=0.10647716373205185 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0003727097483271039}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 1 | 38.7569 | 0.107771 | 0.106477 | 0 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | | | | | |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-11
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 400
time_since_restore: 43.45012712478638
time_this_iter_s: 43.45012712478638
time_total_s: 43.45012712478638
timestamp: 1688612230
train/loss: 0.09639785438776016
training_iteration: 1
trial_id: af083_00001
val/loss: 0.09315478801727295

Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-15
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 800
time_since_restore: 53.415080070495605
time_this_iter_s: 14.658154249191284
time_total_s: 53.415080070495605
timestamp: 1688612235
train/loss: 0.08431793004274368
training_iteration: 2
trial_id: af083_00000
val/loss: 0.09924431890249252

== Status ==
Current time: 2023-07-05 19:57:21 (running for 00:01:00.98)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09924431890249252 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09315478801727295 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 2 | 53.4151 | 0.0843179 | 0.0992443 | 1 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 1 | 43.4501 | 0.0963979 | 0.0931548 | 0 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-25
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 800
time_since_restore: 57.43458795547485
time_this_iter_s: 13.984460830688477
time_total_s: 57.43458795547485
timestamp: 1688612244
train/loss: 0.07889335602521896
training_iteration: 2
trial_id: af083_00001
val/loss: 0.09071239084005356

Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-30
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 1200
time_since_restore: 67.67485308647156
time_this_iter_s: 14.259773015975952
time_total_s: 67.67485308647156
timestamp: 1688612249
train/loss: 0.1051112711429596
training_iteration: 3
trial_id: af083_00000
val/loss: 0.09574979543685913

== Status ==
Current time: 2023-07-05 19:57:30 (running for 00:01:10.23)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09071239084005356 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 3 | 67.6749 | 0.105111 | 0.0957498 | 2 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 2 | 57.4346 | 0.0788934 | 0.0907124 | 1 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

== Status ==
Current time: 2023-07-05 19:57:35 (running for 00:01:15.24)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09071239084005356 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 3 | 67.6749 | 0.105111 | 0.0957498 | 2 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 2 | 57.4346 | 0.0788934 | 0.0907124 | 1 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-39
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 1200
time_since_restore: 71.14706802368164
time_this_iter_s: 13.712480068206787
time_total_s: 71.14706802368164
timestamp: 1688612258
train/loss: 0.08773250132799149
training_iteration: 3
trial_id: af083_00001
val/loss: 0.08743518590927124

Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-43
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 1600
time_since_restore: 81.42374658584595
time_this_iter_s: 13.74889349937439
time_total_s: 81.42374658584595
timestamp: 1688612263
train/loss: 0.09925425052642822
training_iteration: 4
trial_id: af083_00000
val/loss: 0.0947776809334755

e[2me[36m(RayTrainWorker pid=19358)e[0m Trainer.fit stopped: max_epochs=5 reached.
== Status ==
Current time: 2023-07-05 19:57:44 (running for 00:01:23.98)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.0947776809334755 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08743518590927124 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 3 | 71.1471 | 0.0877325 | 0.0874352 | 2 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

== Status ==
Current time: 2023-07-05 19:57:49 (running for 00:01:28.99)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.0947776809334755 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08743518590927124 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 3 | 71.1471 | 0.0877325 | 0.0874352 | 2 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-53
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 1600
time_since_restore: 85.29977869987488
time_this_iter_s: 14.152710676193237
time_total_s: 85.29977869987488
timestamp: 1688612273
train/loss: 0.09886045008897781
training_iteration: 4
trial_id: af083_00001
val/loss: 0.08797292411327362

== Status ==
Current time: 2023-07-05 19:57:58 (running for 00:01:38.31)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08797292411327362 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 4 | 85.2998 | 0.0988605 | 0.0879729 | 3 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-58
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 2000
time_since_restore: 95.820796251297
time_this_iter_s: 14.39704966545105
time_total_s: 95.820796251297
timestamp: 1688612277
train/loss: 0.09550446271896362
training_iteration: 5
trial_id: af083_00000
val/loss: 0.09299320727586746

Trial LightningTrainer_af083_00000 completed.
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-58-03
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 2000
time_since_restore: 95.35388588905334
time_this_iter_s: 10.054107189178467
time_total_s: 95.35388588905334
timestamp: 1688612283
train/loss: 0.09061837941408157
training_iteration: 5
trial_id: af083_00001
val/loss: 0.08739878237247467

Trial LightningTrainer_af083_00001 completed.
2023-07-05 19:58:03,409 INFO tune.py:945 – Total run time: 103.41 seconds (103.37 seconds for the tuning loop).
== Status ==
Current time: 2023-07-05 19:58:03 (running for 00:01:43.37)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 15.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08739878237247467 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 RUNNING, 1 TERMINATED)
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 5 | 95.3539 | 0.0906184 | 0.0873988 | 4 |
| LightningTrainer_af083_00000 | TERMINATED | 131.179.80.122:18992 | 0.00037271 | 5 | 95.8208 | 0.0955045 | 0.0929932 | 4 |
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

== Status ==
Current time: 2023-07-05 19:58:03 (running for 00:01:43.38)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08739878237247467 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 TERMINATED)
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | TERMINATED | 131.179.80.122:18992 | 0.00037271 | 5 | 95.8208 | 0.0955045 | 0.0929932 | 4 |
| LightningTrainer_af083_00001 | TERMINATED | 131.179.80.122:19360 | 0.0011464 | 5 | 95.3539 | 0.0906184 | 0.0873988 | 4 |
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+

e[2me[36m(RayTrainWorker pid=19694)e[0m Trainer.fit stopped: max_epochs=5 reached.

You’ll see it requests 2/4 gpus and then 2*7=14 + 1 =15 CPUs for the first trial.

I also tried upping the CPUs to 8 and 9 and it made no difference, it still complained that I should only create 2 dataloader workers.

Topic		Replies	Views
Logging scalars when running ray[tune] tuning fails in a guild run Troubleshooting	3	725	March 11, 2021
How to use Tune (ray) with Guild General	4	1075	September 9, 2022
Distributed training hanging Troubleshooting	2	4076	August 18, 2022
Pytorch Lightning, Distributed Data Parallel & remote Troubleshooting	0	868	November 3, 2022
Guild.ai x Lightning CLI Troubleshooting	6	459	April 4, 2024

Guild interfering with distributed resource allocation (pytorch-lightning/ray-tune

Related topics