Flag not recognized: No module named --batch_size

Hi there,
out of the blue my guild script now claims:

guild: No module named --batch_size

Batch size is a flag, as defined in the guild.yml

          default: 32

The batch size is an argument of the arg parser

        parser.add_argument("--batch_size", type=int, default=8)

I don’t understand the error message, as it worked before and it is not a syntactic error:

You are about to run model:train
  batch_size: 32

Do you have an idea what could have gone wrong?
I did not change the code and call guild with guild run train.

Previously I had a similar error, where the accelerator flag of the PyTorch Lightning Trainer wouldn’t be recognized (same error message).


Looks like the indentation of the flags specification might be wrong.

Thanks for replying. The configuration file is correct when comparing to the file reference. I noticed that the --batch_size flag is the first one, alphabetically. If I add another flag that comes first, then the batch_size flag is recognized.

I was able to narrow it down further: if i specify the gpus as a comma separated string, then the error occurs:

  # works
  default: 0
  # does not work
  default: 0,1
  # does not work either
  default: "0,1"

How should I encode this?

Have a look at the flag value decoding documentation and arg-split.

Note that guild has integrated support for choosing GPUs.

There are a few mysterious things here that I’m scratching my head over. Do you have a small example I can run to reproduce this?


I was able to resolve some/the issue:

  1. use arg-switch: yes for switches. This caused some errors without indicating the culprit, e.g.
          default: yes
          arg-switch: yes
  1. use arg-split for nargs, e.g.
          default: "50 1"
          arg-split: yes
  1. *PyTorch Lightning’s (PL) ArgParser does tricky things:
    For multi-gpu flags they do not use nargs instead you use strings.
  2. It was difficult for me to find out about the arg-split option and I’m still not sure if my output is a string or an integer.

In PL you can select a specific GPU with --gpus "3" and select the first three GPUs with --gpus 3. Two specific GPUs can be selected with --gpus "1,3", for e.g. GPU 1 and 3. In the latest PL Version they dropped the --gpus "3" option; instead you can use --gpus "3," to target a specific GPU.



$ pip install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install pytorch-lightning lightning-bolts guildai


| guild.yaml
L test_guild
    | __init__.py
    L  main.py


import os
from argparse import ArgumentParser

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
from pl_bolts.datasets import DummyDataset

class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))

    def training_step(self, batch, batch_idx):
        # --------------------------
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('train_loss', loss)
        return loss
        # --------------------------

    def validation_step(self, batch, batch_idx):
        # --------------------------
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('val_loss', loss)
        # --------------------------

    def test_step(self, batch, batch_idx):
        # --------------------------
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('test_loss', loss)
        # --------------------------

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

if __name__ == "__main__":

    train = DummyDataset((1, 28, 28), (1,))
    train = DataLoader(train, batch_size=32)
    val = DummyDataset((1, 28, 28), (1,))
    val = DataLoader(val, batch_size=32)
    test = DummyDataset((1, 28, 28), (1,))
    test = DataLoader(test, batch_size=32)

    # init model
    ae = LitAutoEncoder()

    # Initialize a trainer
    parser = ArgumentParser()
    parser = pl.Trainer.add_argparse_args(parser)
    args = parser.parse_args()
    trainer = pl.Trainer.from_argparse_args(args)

    # Train the model ⚡
    trainer.fit(ae, train, val)


- model: test-guild
    - "*.py"
      main: test_guild.main
          default: 2000
          default: "1,3"

Now, python test_guild/main.py --gpus "1,3" --max_steps=2000 and guild run main are identical.

I really like the experiment tracking capabilities of guild.ai; getting pytorch lightning and guild.ai working together is sometimes tough but worth it.

This thread covers a number of different issues. One that’s highlighted by @Alessandro’s latest post (thank you for the terrific detail!) is that Guild wasn’t handling some of the PyTorch Lightning CLI args on flags import. This has been fixed and will be available in the next release (0.7.5). You can test the functionality in the pre-release version 0.7.5.dev2. This example shows the new behavior.

That looks awesome! Great work!

1 Like