Run Output
2023-07-05 19:40:12,958 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at e[1me[32m127.0.0.1:8265 e[39me[22m
2023-07-05 19:40:22,037 INFO tune.py:218 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
/home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/ray/tune/experiment/experiment.py:170: UserWarning: The `local_dir` argument of `Experiment is deprecated. Use `storage_path` or set the `TUNE_RESULT_DIR` environment variable instead.
warnings.warn(
e[2me[36m(RayTrainWorker pid=28498)e[0m 2023-07-05 19:40:27,230 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
e[2me[36m(RayTrainWorker pid=28498)e[0m GPU available: True (cuda), used: True
e[2me[36m(RayTrainWorker pid=28498)e[0m TPU available: False, using: 0 TPU cores
e[2me[36m(RayTrainWorker pid=28498)e[0m IPU available: False, using: 0 IPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
e[2me[36m(RayTrainWorker pid=28498)e[0m
e[2me[36m(RayTrainWorker pid=28498)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=28498)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28498)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=28498)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=28498)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=28498)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28498)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=28498)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=28498)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=28498)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=28498)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=28498)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28754)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=28498)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log('val/loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=28498)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=28754)e[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
e[2me[36m(RayTrainWorker pid=28754)e[0m
e[2me[36m(RayTrainWorker pid=28754)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=28754)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28754)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=28754)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=28754)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=28754)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=28754)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=28754)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=28754)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=28754)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=28754)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=28754)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=28754)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=28754)e[0m warning_cache.warn(
== Status ==
Current time: 2023-07-05 19:40:24 (running for 00:00:02.58)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 8.0/32 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 PENDING, 1 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 |
| LightningTrainer_740a3_00001 | PENDING | | 0.000597133 |
+------------------------------+----------+----------------------+------------------------+
== Status ==
Current time: 2023-07-05 19:40:29 (running for 00:00:07.60)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 |
+------------------------------+----------+----------------------+------------------------+
== Status ==
Current time: 2023-07-05 19:40:34 (running for 00:00:12.60)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 |
+------------------------------+----------+----------------------+------------------------+
== Status ==
Current time: 2023-07-05 19:40:39 (running for 00:00:17.61)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 |
+------------------------------+----------+----------------------+------------------------+
== Status ==
Current time: 2023-07-05 19:40:44 (running for 00:00:22.61)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------+----------+----------------------+------------------------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 |
+------------------------------+----------+----------------------+------------------------+
Result for LightningTrainer_740a3_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-40-48
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 28144
should_checkpoint: true
step: 800
time_since_restore: 23.41086745262146
time_this_iter_s: 23.41086745262146
time_total_s: 23.41086745262146
timestamp: 1688611247
train/loss: 0.13250067830085754
training_iteration: 1
trial_id: 740a3_00000
val/loss: 0.11762388795614243
== Status ==
Current time: 2023-07-05 19:40:53 (running for 00:00:31.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.11762388795614243
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.11762388795614243 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 | 1 | 23.4109 | 0.132501 | 0.117624 | 0 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 | | | | | |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
Result for LightningTrainer_740a3_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-40-53
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 28500
should_checkpoint: true
step: 800
time_since_restore: 25.342666387557983
time_this_iter_s: 25.342666387557983
time_total_s: 25.342666387557983
timestamp: 1688611253
train/loss: 0.10083736479282379
training_iteration: 1
trial_id: 740a3_00001
val/loss: 0.1018548458814621
Result for LightningTrainer_740a3_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-40-57
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 28144
should_checkpoint: true
step: 1600
time_since_restore: 32.800697326660156
time_this_iter_s: 9.389829874038696
time_total_s: 32.800697326660156
timestamp: 1688611257
train/loss: 0.08113399147987366
training_iteration: 2
trial_id: 740a3_00000
val/loss: 0.10029216855764389
== Status ==
Current time: 2023-07-05 19:41:02 (running for 00:00:40.40)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.10029216855764389 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.10029216855764389 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 | 2 | 32.8007 | 0.081134 | 0.100292 | 1 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 | 1 | 25.3427 | 0.100837 | 0.101855 | 0 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
Result for LightningTrainer_740a3_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-41-03
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 28500
should_checkpoint: true
step: 1600
time_since_restore: 34.708067655563354
time_this_iter_s: 9.365401268005371
time_total_s: 34.708067655563354
timestamp: 1688611263
train/loss: 0.07746944576501846
training_iteration: 2
trial_id: 740a3_00001
val/loss: 0.09756060689687729
Result for LightningTrainer_740a3_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-41-07
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 28144
should_checkpoint: true
step: 2400
time_since_restore: 42.57479667663574
time_this_iter_s: 9.774099349975586
time_total_s: 42.57479667663574
timestamp: 1688611267
train/loss: 0.09988719969987869
training_iteration: 3
trial_id: 740a3_00000
val/loss: 0.0968434065580368
== Status ==
Current time: 2023-07-05 19:41:12 (running for 00:00:50.18)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00000 with val/loss=0.0968434065580368 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.00023670587019652509}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 | 3 | 42.5748 | 0.0998872 | 0.0968434 | 2 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 | 2 | 34.7081 | 0.0774694 | 0.0975606 | 1 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
Result for LightningTrainer_740a3_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-41-12
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 28500
should_checkpoint: true
step: 2400
time_since_restore: 44.249186754226685
time_this_iter_s: 9.54111909866333
time_total_s: 44.249186754226685
timestamp: 1688611272
train/loss: 0.08657821267843246
training_iteration: 3
trial_id: 740a3_00001
val/loss: 0.09475232660770416
Result for LightningTrainer_740a3_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-41-16
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 28144
should_checkpoint: true
step: 3200
time_since_restore: 52.18841528892517
time_this_iter_s: 9.613618612289429
time_total_s: 52.18841528892517
timestamp: 1688611276
train/loss: 0.09277255833148956
training_iteration: 4
trial_id: 740a3_00000
val/loss: 0.09569665789604187
e[2me[36m(RayTrainWorker pid=28498)e[0m `Trainer.fit` stopped: `max_epochs=5` reached.
2023-07-05 19:41:29,815 INFO tune.py:945 -- Total run time: 67.78 seconds (67.74 seconds for the tuning loop).
== Status ==
Current time: 2023-07-05 19:41:21 (running for 00:00:59.78)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.09569665789604187 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 16.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09475232660770416 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | RUNNING | 131.179.80.122:28144 | 0.000236706 | 4 | 52.1884 | 0.0927726 | 0.0956967 | 3 |
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 | 3 | 44.2492 | 0.0865782 | 0.0947523 | 2 |
+------------------------------+----------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
Result for LightningTrainer_740a3_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-41-22
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 28500
should_checkpoint: true
step: 3200
time_since_restore: 53.791813373565674
time_this_iter_s: 9.54262661933899
time_total_s: 53.791813373565674
timestamp: 1688611282
train/loss: 0.09039577096700668
training_iteration: 4
trial_id: 740a3_00001
val/loss: 0.09388325363397598
Result for LightningTrainer_740a3_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-41-26
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 28144
should_checkpoint: true
step: 4000
time_since_restore: 61.73058104515076
time_this_iter_s: 9.542165756225586
time_total_s: 61.73058104515076
timestamp: 1688611286
train/loss: 0.09823152422904968
training_iteration: 5
trial_id: 740a3_00000
val/loss: 0.09352872520685196
Trial LightningTrainer_740a3_00000 completed.
Result for LightningTrainer_740a3_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-41-29
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 28500
should_checkpoint: true
step: 4000
time_since_restore: 61.19364666938782
time_this_iter_s: 7.4018332958221436
time_total_s: 61.19364666938782
timestamp: 1688611289
train/loss: 0.093507781624794
training_iteration: 5
trial_id: 740a3_00001
val/loss: 0.09201841056346893
Trial LightningTrainer_740a3_00001 completed.
== Status ==
Current time: 2023-07-05 19:41:29 (running for 00:01:07.73)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09478995576500893 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 8.0/32 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09201841056346893 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 RUNNING, 1 TERMINATED)
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00001 | RUNNING | 131.179.80.122:28500 | 0.000597133 | 5 | 61.1936 | 0.0935078 | 0.0920184 | 4 |
| LightningTrainer_740a3_00000 | TERMINATED | 131.179.80.122:28144 | 0.000236706 | 5 | 61.7306 | 0.0982315 | 0.0935287 | 4 |
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
== Status ==
Current time: 2023-07-05 19:41:29 (running for 00:01:07.75)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09478995576500893 | Iter 2.000: -0.09892638772726059 | Iter 1.000: -0.10973936691880226
Logical resource usage: 0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: 740a3_00001 with val/loss=0.09201841056346893 and parameters={'lightning_config': {'_module_init_config': {'lr': 0.0005971329502182427}, '_trainer_init_config': {}, '_trainer_fit_params': {}, '_ddp_strategy_config': {}, '_model_checkpoint_config': {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 TERMINATED)
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------|
| LightningTrainer_740a3_00000 | TERMINATED | 131.179.80.122:28144 | 0.000236706 | 5 | 61.7306 | 0.0982315 | 0.0935287 | 4 |
| LightningTrainer_740a3_00001 | TERMINATED | 131.179.80.122:28500 | 0.000597133 | 5 | 61.1936 | 0.0935078 | 0.0920184 | 4 |
+------------------------------+------------+----------------------+------------------------+--------+------------------+--------------+------------+---------+
e[2me[36m(RayTrainWorker pid=28754)e[0m `Trainer.fit` stopped: `max_epochs=5` reached.
The resource missingness I was concerned about revolves less around the reported resource usage and more around that warning for the dataloader. Particularly, it’s strange that if I have 8/32 CPUs used, surely I should have space for 7 workers for the dataloader, yet we see that warning. Much less even when there are more resources being used/requested.
I believe the progression of resources you’re seeing there is intended behavior from ray. Essentially each group of num_workers (1) is assigned to a trial with the requested resources (7 CPUs 1 GPU). Ray will pick up the first trial and assign the 1GPU and 7CPUs + 1 CPU for a head thread that manages everything. Then it sees there are enough resources to run the second trial in parallel, so it then devotes another 1 GPU + 8 CPUs, resulting in 16/32 and 2/4. Then when tuning has ended and all trials are finished, ray doesn’t need any resources anymore 0/32, 0/4.
Here’s the run output of python mwe.py
where tune workers =2, dataloader workers =7, cpus=7, gpus=1 (which still has the same warning lol):
Run Output
2023-07-05 19:56:11,115 INFO worker.py:1616 – Started a local Ray instance. View the dashboard at e[1me[32m127.0.0.1:8265 e[39me[22m
2023-07-05 19:56:20,003 INFO tune.py:218 – Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(…)` before `Tuner(…)`.
/home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/ray/tune/experiment/experiment.py:170: UserWarning: The `local_dir` argument of `Experiment is deprecated. Use `storage_path` or set the `TUNE_RESULT_DIR` environment variable instead.
warnings.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m 2023-07-05 19:56:26,425 INFO config.py:86 – Setting up process group for: env:// [rank=0, world_size=2]
e[2me[36m(RayTrainWorker pid=19358)e[0m GPU available: True (cuda), used: True
e[2me[36m(RayTrainWorker pid=19358)e[0m TPU available: False, using: 0 TPU cores
e[2me[36m(RayTrainWorker pid=19358)e[0m IPU available: False, using: 0 IPUs
e[2me[36m(RayTrainWorker pid=19358)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m 2023-07-05 19:56:32,289 INFO config.py:86 – Setting up process group for: env:// [rank=0, world_size=2]
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19694)e[0m HPU available: False, using: 0 HPUs
e[2me[36m(RayTrainWorker pid=19361)e[0m LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
e[2me[36m(RayTrainWorker pid=19358)e[0m
e[2me[36m(RayTrainWorker pid=19358)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=19358)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19358)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=19358)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=19358)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=19358)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19358)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=19358)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=19358)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=19358)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=19361)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=19361)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=19358)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log(‘val/loss’, …, sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19358)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19695)e[0m LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
e[2me[36m(RayTrainWorker pid=19694)e[0m
e[2me[36m(RayTrainWorker pid=19694)e[0m | Name | Type | Params
e[2me[36m(RayTrainWorker pid=19694)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19694)e[0m 0 | loss | MSELoss | 0
e[2me[36m(RayTrainWorker pid=19694)e[0m 1 | fc1 | Linear | 55
e[2me[36m(RayTrainWorker pid=19694)e[0m 2 | fc2 | Linear | 60
e[2me[36m(RayTrainWorker pid=19694)e[0m ---------------------------------
e[2me[36m(RayTrainWorker pid=19694)e[0m 115 Trainable params
e[2me[36m(RayTrainWorker pid=19694)e[0m 0 Non-trainable params
e[2me[36m(RayTrainWorker pid=19694)e[0m 115 Total params
e[2me[36m(RayTrainWorker pid=19694)e[0m 0.000 Total estimated model params size (MB)
e[2me[36m(RayTrainWorker pid=19695)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 7 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
e[2me[36m(RayTrainWorker pid=19695)e[0m warnings.warn(_create_warning_msg(
e[2me[36m(RayTrainWorker pid=19694)e[0m /home/davina/mambaforge/envs/ap/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:432: PossibleUserWarning: It is recommended to use `self.log(‘val/loss’, …, sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
e[2me[36m(RayTrainWorker pid=19694)e[0m warning_cache.warn(
== Status ==
Current time: 2023-07-05 19:56:22 (running for 00:00:02.55)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 15.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 PENDING, 1 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | PENDING | | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:28 (running for 00:00:08.00)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:33 (running for 00:00:13.00)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:38 (running for 00:00:18.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:43 (running for 00:00:23.01)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:48 (running for 00:00:28.02)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:53 (running for 00:00:33.02)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
== Status ==
Current time: 2023-07-05 19:56:58 (running for 00:00:38.03)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------+
| Trial name | status | loc | lightning_config/_mo |
| | | | dule_init_config/lr |
|------------------------------±---------±---------------------±-----------------------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 |
±-----------------------------±---------±---------------------±-----------------------+
Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-01
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 400
time_since_restore: 38.75692582130432
time_this_iter_s: 38.75692582130432
time_total_s: 38.75692582130432
timestamp: 1688612220
train/loss: 0.10777071118354797
training_iteration: 1
trial_id: af083_00000
val/loss: 0.10647716373205185
== Status ==
Current time: 2023-07-05 19:57:06 (running for 00:00:46.32)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.10647716373205185
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00000 with val/loss=0.10647716373205185 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0003727097483271039}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 1 | 38.7569 | 0.107771 | 0.106477 | 0 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | | | | | |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
== Status ==
Current time: 2023-07-05 19:57:11 (running for 00:00:51.33)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.10647716373205185
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00000 with val/loss=0.10647716373205185 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0003727097483271039}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 1 | 38.7569 | 0.107771 | 0.106477 | 0 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | | | | | |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-11
done: false
epoch: 0
hostname: lambda2
iterations_since_restore: 1
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 400
time_since_restore: 43.45012712478638
time_this_iter_s: 43.45012712478638
time_total_s: 43.45012712478638
timestamp: 1688612230
train/loss: 0.09639785438776016
training_iteration: 1
trial_id: af083_00001
val/loss: 0.09315478801727295
Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-15
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 800
time_since_restore: 53.415080070495605
time_this_iter_s: 14.658154249191284
time_total_s: 53.415080070495605
timestamp: 1688612235
train/loss: 0.08431793004274368
training_iteration: 2
trial_id: af083_00000
val/loss: 0.09924431890249252
== Status ==
Current time: 2023-07-05 19:57:21 (running for 00:01:00.98)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09924431890249252 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09315478801727295 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 2 | 53.4151 | 0.0843179 | 0.0992443 | 1 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 1 | 43.4501 | 0.0963979 | 0.0931548 | 0 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-25
done: false
epoch: 1
hostname: lambda2
iterations_since_restore: 2
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 800
time_since_restore: 57.43458795547485
time_this_iter_s: 13.984460830688477
time_total_s: 57.43458795547485
timestamp: 1688612244
train/loss: 0.07889335602521896
training_iteration: 2
trial_id: af083_00001
val/loss: 0.09071239084005356
Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-30
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 1200
time_since_restore: 67.67485308647156
time_this_iter_s: 14.259773015975952
time_total_s: 67.67485308647156
timestamp: 1688612249
train/loss: 0.1051112711429596
training_iteration: 3
trial_id: af083_00000
val/loss: 0.09574979543685913
== Status ==
Current time: 2023-07-05 19:57:30 (running for 00:01:10.23)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09071239084005356 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 3 | 67.6749 | 0.105111 | 0.0957498 | 2 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 2 | 57.4346 | 0.0788934 | 0.0907124 | 1 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
== Status ==
Current time: 2023-07-05 19:57:35 (running for 00:01:15.24)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.09071239084005356 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 3 | 67.6749 | 0.105111 | 0.0957498 | 2 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 2 | 57.4346 | 0.0788934 | 0.0907124 | 1 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-39
done: false
epoch: 2
hostname: lambda2
iterations_since_restore: 3
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 1200
time_since_restore: 71.14706802368164
time_this_iter_s: 13.712480068206787
time_total_s: 71.14706802368164
timestamp: 1688612258
train/loss: 0.08773250132799149
training_iteration: 3
trial_id: af083_00001
val/loss: 0.08743518590927124
Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-43
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 1600
time_since_restore: 81.42374658584595
time_this_iter_s: 13.74889349937439
time_total_s: 81.42374658584595
timestamp: 1688612263
train/loss: 0.09925425052642822
training_iteration: 4
trial_id: af083_00000
val/loss: 0.0947776809334755
e[2me[36m(RayTrainWorker pid=19358)e[0m Trainer.fit
stopped: max_epochs=5
reached.
== Status ==
Current time: 2023-07-05 19:57:44 (running for 00:01:23.98)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.0947776809334755 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08743518590927124 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 3 | 71.1471 | 0.0877325 | 0.0874352 | 2 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
== Status ==
Current time: 2023-07-05 19:57:49 (running for 00:01:28.99)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.0947776809334755 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08743518590927124 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 3 | 71.1471 | 0.0877325 | 0.0874352 | 2 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-57-53
done: false
epoch: 3
hostname: lambda2
iterations_since_restore: 4
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 1600
time_since_restore: 85.29977869987488
time_this_iter_s: 14.152710676193237
time_total_s: 85.29977869987488
timestamp: 1688612273
train/loss: 0.09886045008897781
training_iteration: 4
trial_id: af083_00001
val/loss: 0.08797292411327362
== Status ==
Current time: 2023-07-05 19:57:58 (running for 00:01:38.31)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 30.0/32 CPUs, 4.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08797292411327362 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 RUNNING)
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | RUNNING | 131.179.80.122:18992 | 0.00037271 | 4 | 81.4237 | 0.0992543 | 0.0947777 | 3 |
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 4 | 85.2998 | 0.0988605 | 0.0879729 | 3 |
±-----------------------------±---------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
Result for LightningTrainer_af083_00000:
_report_on: train_epoch_end
date: 2023-07-05_19-57-58
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 18992
should_checkpoint: true
step: 2000
time_since_restore: 95.820796251297
time_this_iter_s: 14.39704966545105
time_total_s: 95.820796251297
timestamp: 1688612277
train/loss: 0.09550446271896362
training_iteration: 5
trial_id: af083_00000
val/loss: 0.09299320727586746
Trial LightningTrainer_af083_00000 completed.
Result for LightningTrainer_af083_00001:
_report_on: train_epoch_end
date: 2023-07-05_19-58-03
done: true
epoch: 4
hostname: lambda2
iterations_since_restore: 5
node_ip: 131.179.80.122
pid: 19360
should_checkpoint: true
step: 2000
time_since_restore: 95.35388588905334
time_this_iter_s: 10.054107189178467
time_total_s: 95.35388588905334
timestamp: 1688612283
train/loss: 0.09061837941408157
training_iteration: 5
trial_id: af083_00001
val/loss: 0.08739878237247467
Trial LightningTrainer_af083_00001 completed.
2023-07-05 19:58:03,409 INFO tune.py:945 – Total run time: 103.41 seconds (103.37 seconds for the tuning loop).
== Status ==
Current time: 2023-07-05 19:58:03 (running for 00:01:43.37)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 15.0/32 CPUs, 2.0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08739878237247467 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (1 RUNNING, 1 TERMINATED)
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00001 | RUNNING | 131.179.80.122:19360 | 0.0011464 | 5 | 95.3539 | 0.0906184 | 0.0873988 | 4 |
| LightningTrainer_af083_00000 | TERMINATED | 131.179.80.122:18992 | 0.00037271 | 5 | 95.8208 | 0.0955045 | 0.0929932 | 4 |
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
== Status ==
Current time: 2023-07-05 19:58:03 (running for 00:01:43.38)
Using AsyncHyperBand: num_stopped=2
Bracket: Iter 4.000: -0.09137530252337456 | Iter 2.000: -0.09497835487127304 | Iter 1.000: -0.0998159758746624
Logical resource usage: 0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:RTX)
Current best trial: af083_00001 with val/loss=0.08739878237247467 and parameters={‘lightning_config’: {‘_module_init_config’: {‘lr’: 0.0011463981813052714}, ‘_trainer_init_config’: {}, ‘_trainer_fit_params’: {}, ‘_ddp_strategy_config’: {}, ‘_model_checkpoint_config’: {}}}
Result logdir: /home/davina/Private/repos/autopopulus/issue-resolution/my.guild.ai-1050-guild-interfering-with-distributed-resource-allocation-pytorch-lightning-ray-tune/ray_results/mwe
Number of trials: 2/2 (2 TERMINATED)
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
| Trial name | status | loc | lightning_config/_mo | iter | total time (s) | train/loss | val/loss | epoch |
| | | | dule_init_config/lr | | | | | |
|------------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------|
| LightningTrainer_af083_00000 | TERMINATED | 131.179.80.122:18992 | 0.00037271 | 5 | 95.8208 | 0.0955045 | 0.0929932 | 4 |
| LightningTrainer_af083_00001 | TERMINATED | 131.179.80.122:19360 | 0.0011464 | 5 | 95.3539 | 0.0906184 | 0.0873988 | 4 |
±-----------------------------±-----------±---------------------±-----------------------±-------±-----------------±-------------±-----------±--------+
e[2me[36m(RayTrainWorker pid=19694)e[0m Trainer.fit
stopped: max_epochs=5
reached.
You’ll see it requests 2/4 gpus and then 2*7=14 + 1 =15 CPUs for the first trial.
I also tried upping the CPUs to 8 and 9 and it made no difference, it still complained that I should only create 2 dataloader workers.