Guild.ai x Lightning CLI

(PyTorch) Lightning has moved on to a more sophisticated CLI based on argparse (Configure hyperparameters from the CLI — PyTorch Lightning 2.1.2 documentation) and I’m facing some troubles combining both.

I was able to “hack” everything together with the only obstacle that the lightning CLI expects commands like python -m path/to/main.py fit --model.lr=3e-4 and I see no way to emulate the subcommand “fit” in my guild.yml

Ideally, guild could simply import the config.yml that pytorch lightning expects (flags-import and flags-dest) and transform it to the equivalent command line arguments

So given

seed_everything: 42
trainer:
    accelerator: cpu
    deterministic: True
    fast_dev_run: True
...

I can run python trainer.py fit --config config.yml
or directly python trainer.py fit --seed_everything 42 --trainer.accelerator cpu ...

In guild I tried the following:

test:
  main: trainer
  flags:
    seed_everything:
      default: 42
    trainer.accelerator:
      default: cpu
...

However the subcommand fit can’t be modeled this way.
Any ideas how to do this? I’d love to leverage guild, for example, for running multiple trials.

Cheers,
Alessandro

1 Like

I was able to build a “fix”. But I hope there is a less ugly way to solve it properly.

import guild.ipy as guild
from models.nn import trainer
guildparams = {}
args = {
        "fit": {
            "seed_everything": 42,
            "trainer": {
                "accelerator": "cpu",
                "deterministic": True,
                "fast_dev_run": False,
            }
...
}


def modify_dict(original_dict, modification_dict):
    for key, value in modification_dict.items():
        if isinstance(value, dict) and key in original_dict and isinstance(original_dict[key], dict):
            # If the value is a nested dictionary, recursively update it
            modify_dict(original_dict[key], value)
        else:
            # Update the value if the key exists, otherwise add the key-value pair
            original_dict[key] = value


if __name__ == "__main__":
    modify_dict(args, guildparams)
    run, return_val = guild.run(trainer.main, args)
from lightning.pytorch.cli import LightningCLI

from models.nn.data import DataModule
from models.nn.model import LightningModel


def main(args=None):
    cli = LightningCLI(LightningModel, DataModule, args=args, subclass_mode_model=True)
test:
  main: run
  flags-dest: global:guildparams
  flags-import: all
  flags:
    fit.trainer.fast_dev_run: True

Given the following run.py, trainer, and guild.yml we can start a training run using guild run test and are able to pass command-line arguments as expected, e.g., guild run test fit.seed_everything=[1,2,3] --force-flags.

Guild seems to convert the flattened input arguments into a nested dictionary. Given these, we can update the args using modify_dict and then pass the args to the LightningCLI using the trainer.main(args) function. To record all arguments we use guild.run instead of calling trainer.main(args) directly. Theoretically, one could replace the args dict with a config.yml.

Edit: Another downside of this hack is that we run to runs. the test run and the main() run. The main() run doesn’t track any sourcecode, the test run doesn’t track args. We can extend the hack by connecting these runs: run, return_val = guild.run(trainer.main, args, id=guild.runs().iloc[0].run.run.id). It works but I am sure that there must be a better option.

Taking a look at this now.

1 Like

have you tried using exec? you can use flag values with ${FLAG_NAME} in the exec expression.

if i’m following your issue here, i think the solution to this is:

trainer:
  exec: python -m trainer.py fit --model.lr=${model_lr}
  flags:
    model_lr:
      default: 3e-4

also, have you tried adding:

trainer:
  flags-dest: config:config.yml
  flags-import: all
  ...

let me know if this solves the problem for you

1 Like

@Alessandro The wrapper you’ve created is a good fallback but I agree it’d be far better if Guild supported the Lightning CLI cleanly. This would likely require a patch ala the argparse and click support to work.

Have you considered using a config file as your interim interface? This would at least config you out of using a Python wrapper with the ipy module and into config file based flag defs. You might still need a wrapper to map config to command args but it wouldn’t be Guild specific.

I realize this is all a pain if you’re more interest in doing actual work! I wish I had a better out of the box approach for you.

Thank you! That solved it nicely after stepping through the LightningCLI codebase.

I refactored my code into:

# guild.yml
config:
  exec: python my_model/main.py fit --config config.yml
  sourcecode:
    - "my_model/*.py"
  flags-dest: config:config.yml
  flags-import: all
  flags:
    # flags you want to override
    trainer.fast_dev_run: True
    # ...

Where the config.yml contains additonal flags, that may be overwritten by guild, e.g.

# config.yml
# basic configuration, e.g., overriding default parameters
trainer:
  accelerator: cpu
  deterministic: True
  fast_dev_run: False
# ...

Guild creates a config.yml in the run dir [Docs].
This can then be used by the main.py [Tutorial]:

# main.py
from lightning.pytorch.cli import LightningCLI

def main(args=None):
    LightningCLI(args=args, subclass_mode_model=True)

if __name__ == "__main__":
    main()

And finally, PyTorch Lightning creates the fully parameterized config.yml in, e.g., /path/to/run_dir/lightning_logs/version_0/config.yaml.

Beautiful. Thank you all!

Edit: I created a small sample repository at GitHub - AlessandroW/Guild.ai-x-PyTorch-Lightning: Sample implementation of orchestrating PyTorch Lightning models via its CLI managed by guild.ai

1 Like

@Alessandro, thanks for the sample repository. It helped me a lot.

If it helps anyone, below I show how I could complement it with a dependency use case.

I tried to run a validation op with the training op as a requirement, but I kept getting argparse.ArgumentError: Validation failed: No action for key "operation:facornet:train" to check its value..

My guild config was

        - operation: facornet:train
          select: exp/checkpoints/.*\.(\d+)\.ckpt
          rename: exp/checkpoints/best.ckpt

At last, I solved it by subclassing LightningCLI, as below

class MyLightningCLI(LightningCLI):
    def add_arguments_to_parser(self, parser):
        # Add any custom arguments
        parser.add_argument("--operation:facornet:train", default=None)
        return parser

def main(args=None):
    MyLightningCLI(args=args, subclass_mode_model=True)

What I couldn’t do was use select-max, but I am OK with that for now.