Hi,
I was able to resolve some/the issue:
- use
arg-switch: yes
for switches. This caused some errors without indicating the culprit, e.g.
flags:
deterministic:
default: yes
arg-switch: yes
- use
arg-split
for nargs, e.g.
pos_weight:
default: "50 1"
arg-split: yes
- *PyTorch Lightning’s (PL) ArgParser does tricky things:
For multi-gpu flags they do not use nargs instead you use strings.
- It was difficult for me to find out about the
arg-split
option and I’m still not sure if my output is a string or an integer.
In PL you can select a specific GPU with --gpus "3"
and select the first three GPUs with --gpus 3
. Two specific GPUs can be selected with --gpus "1,3"
, for e.g. GPU 1 and 3. In the latest PL Version they dropped the --gpus "3"
option; instead you can use --gpus "3,"
to target a specific GPU.
Example
Installation
$ pip install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install pytorch-lightning lightning-bolts guildai
Filestructure
test-guild
| guild.yaml
L test_guild
| __init__.py
L main.py
main.py
import os
from argparse import ArgumentParser
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
from pl_bolts.datasets import DummyDataset
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))
def training_step(self, batch, batch_idx):
# --------------------------
# REPLACE WITH YOUR OWN
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log('train_loss', loss)
return loss
# --------------------------
def validation_step(self, batch, batch_idx):
# --------------------------
# REPLACE WITH YOUR OWN
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log('val_loss', loss)
# --------------------------
def test_step(self, batch, batch_idx):
# --------------------------
# REPLACE WITH YOUR OWN
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log('test_loss', loss)
# --------------------------
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == "__main__":
train = DummyDataset((1, 28, 28), (1,))
train = DataLoader(train, batch_size=32)
val = DummyDataset((1, 28, 28), (1,))
val = DataLoader(val, batch_size=32)
test = DummyDataset((1, 28, 28), (1,))
test = DataLoader(test, batch_size=32)
# init model
ae = LitAutoEncoder()
# Initialize a trainer
parser = ArgumentParser()
parser = pl.Trainer.add_argparse_args(parser)
args = parser.parse_args()
trainer = pl.Trainer.from_argparse_args(args)
# Train the model ⚡
trainer.fit(ae, train, val)
guild.yaml
- model: test-guild
sourcecode:
- "*.py"
operations:
train:
main: test_guild.main
flags:
max_steps:
default: 2000
gpus:
default: "1,3"
Now, python test_guild/main.py --gpus "1,3" --max_steps=2000
and guild run main
are identical.
I really like the experiment tracking capabilities of guild.ai; getting pytorch lightning and guild.ai working together is sometimes tough but worth it.