I have run some operations on remotes successfully in the past. However, there was always a some discrepancy between the imports for local and remote runs that I needed to fix by trial and error.
In my current setup, I switched from flags as global variables in the training script to flags in config.yml files. And I’m unable to make it work on remotes.
Project structure
Project:
- [some folders]
- datasets → module, contains data loaders + their config.yml files
- zoo → guild Home for local runs
- models → model definitions
-
- guild.yml
-
- abstract_model.py
-
- conv_lstm → model I want to run
-
-
- model.py → model definition
-
-
-
- train.py → training script
-
-
-
- config.yml → flags
-
Guild file
# Standard convolutional LSTM
- model: conv_lstm
description: Convolutional LSTM
operations:
train_local:
description: Train Convolutional LSTM
sourcecode:
- conv_lstm/train.py
- conv_lstm/model.py
- abstract_model.py
requires:
- config: conv_lstm/config.yml
- file: ../datasets/
main: conv_lstm/train
flags-dest: config:conv_lstm/config.yml
flags-import: all
flags:
epochs: 100
dataset_args:
- dataset_name: ucsd
batch_size: 2
output-scalars:
train_loss: 'Train mse: (\value)'
test_acc: 'Test mse: (\value)'
train_remote:
description: Train Convolutional LSTM on remote
sourcecode:
- conv_lstm/train.py
- conv_lstm/model.py
- abstract_model.py
requires:
- config: conv_lstm/config.yml
- file: ../datasets/
main: conv_lstm/train
flags-dest: config:conv_lstm/config.yml
flags-import: all
flags:
optimizer: Adam
loss: mse
learning_rate: 0.001
epochs: 100
dev: True
gpus: [7]
dataset_args:
- dataset_name: ucsd
batch_size: 2
train_path: ~/data/ucsd/UCSDped1/Train/
test_path: ~/data/ucsd/UCSDped1/Test/
output-scalars:
train_loss: 'Train mse: (\value)'
test_acc: 'Test mse: (\value)'
Training script
sys.path.append('../')
sys.path.append('../../datasets')
# Tensorflow logging level
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
import tensorflow as tf
import yaml
from model import ConvLSTM
from datasets.data_loader import DataLoader
# Load the model configuration
class Config(object):
def __init__(self, filename):
self.__dict__.update(yaml.safe_load(open(filename)))
config = Config("config.yml")
(...)
Current situation & error
I’m able to run ‘conv_lstm:train_local’ without any issues, and everything works as expected. However, almost the same configuration, with a few flags changed, fails to run on remote.
Issue 1: I cannot see any evidence of the config.yml file being copied to the remote
Issue 2: the remote run fails to find the main training script, even though it works locally.
guild -H /home/bleporowski/Projects/mad/zoo run conv_lstm:train_remote --remote [remote_name] --gpus 7
You are about to run conv_lstm:train_remote as a batch (1 trial) on [remote_name]
dataset_args: [{batch_size: 2, dataset_name: ucsd, test_path: ~/data/ucsd/UCSDped1/Test/, train_path: ~/data/ucsd/UCSDped1/Train/}]
dev: yes
epochs: 100
gpus: [7]
learning_rate: 0.001
loss: mse
optimizer: Adam
Continue? (Y/n) y
Building package
package src: /home/bleporowski/Projects/mad/models
package dist: /tmp/guild-remote-stage-eq7ahi7e
running clean
removing 'build/lib' (and everything under it)
removing 'build/bdist.linux-x86_64' (and everything under it)
'build/scripts-3.8' does not exist -- can't clean it
removing 'build'
running bdist_wheel
running build
running build_py
package init file '/home/bleporowski/Projects/mad/models/__init__.py' not found (or not a regular file)
creating build
creating build/lib
creating build/lib/conv_lstm
copying /home/bleporowski/Projects/mad/models/abstract_model.py -> build/lib/conv_lstm
copying /home/bleporowski/Projects/mad/models/guild.yml -> build/lib/conv_lstm
installing to build/bdist.linux-x86_64/wheel
running install
running install_lib
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/wheel
creating build/bdist.linux-x86_64/wheel/conv_lstm
copying build/lib/conv_lstm/guild.yml -> build/bdist.linux-x86_64/wheel/conv_lstm
copying build/lib/conv_lstm/abstract_model.py -> build/bdist.linux-x86_64/wheel/conv_lstm
running install_egg_info
running egg_info
writing conv_lstm.egg-info/PKG-INFO
writing dependency_links to conv_lstm.egg-info/dependency_links.txt
writing entry points to conv_lstm.egg-info/entry_points.txt
writing namespace_packages to conv_lstm.egg-info/namespace_packages.txt
writing top-level names to conv_lstm.egg-info/top_level.txt
reading manifest file 'conv_lstm.egg-info/SOURCES.txt'
writing manifest file 'conv_lstm.egg-info/SOURCES.txt'
Copying conv_lstm.egg-info to build/bdist.linux-x86_64/wheel/conv_lstm-0.0.0-py3.8.egg-info
running install_scripts
creating build/bdist.linux-x86_64/wheel/conv_lstm-0.0.0.dist-info/WHEEL
creating '/tmp/guild-remote-stage-eq7ahi7e/conv_lstm-0.0.0-py2.py3-none-any.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
adding 'conv_lstm/abstract_model.py'
adding 'conv_lstm/guild.yml'
adding 'conv_lstm-0.0.0.dist-info/METADATA'
adding 'conv_lstm-0.0.0.dist-info/PACKAGE'
adding 'conv_lstm-0.0.0.dist-info/WHEEL'
adding 'conv_lstm-0.0.0.dist-info/entry_points.txt'
adding 'conv_lstm-0.0.0.dist-info/namespace_packages.txt'
adding 'conv_lstm-0.0.0.dist-info/top_level.txt'
adding 'conv_lstm-0.0.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
Initializing remote run
Copying package
sending incremental file list
conv_lstm-0.0.0-py2.py3-none-any.whl
sent 3,558 bytes received 35 bytes 1,437.20 bytes/sec
total size is 3,424 speedup is 0.95
Installing package and its dependencies
Processing ./conv_lstm-0.0.0-py2.py3-none-any.whl
Installing collected packages: conv-lstm
Successfully installed conv-lstm-0.0.0
Starting conv_lstm:train_remote on charybdis as 8a26ca399039412fb31c7791d293b507
WARNING: [Errno 2] No such file or directory: 'conv_lstm/config.yml'
WARNING: [Errno 2] No such file or directory: 'conv_lstm/config.yml'
WARNING: cannot import flags from conv_lstm/train: No module named conv_lstm/train
WARNING: cannot import flags from conv_lstm/train: No module named conv_lstm/train
INFO: [guild] Running trial 05afef0858f74b4198af160c6d904e2e: conv-lstm/conv_lstm:train_remote (dataset_args={batch_size: 2, dataset_name: ucsd, test_path: ~/data/ucsd/UCSDped1/Test/, train_path: ~/data/ucsd/UCSDped1/Train/}, dev=yes, epochs=100, gpus=7, learning_rate=0.001, loss=mse, optimizer=Adam)
INFO: [guild] Resolving config:conv_lstm/config.yml dependency
ERROR: [guild] Trial 05afef0858f74b4198af160c6d904e2e exited with an error: (1) run failed because a dependency was not met: could not resolve 'config:conv_lstm/config.yml' in config:conv_lstm/config.yml resource: cannot find source file 'conv_lstm/config.yml'
Run 8a26ca399039412fb31c7791d293b507 stopped with a status of 'completed'
Do remote runs require everything to become a module, with ‘init.py’? Or should the guild file be in a different location?