Local run vs remote run dependencies

I have run some operations on remotes successfully in the past. However, there was always a some discrepancy between the imports for local and remote runs that I needed to fix by trial and error.

In my current setup, I switched from flags as global variables in the training script to flags in config.yml files. And I’m unable to make it work on remotes.

Project structure
Project:

  • [some folders]
  • datasets → module, contains data loaders + their config.yml files
  • zoo → guild Home for local runs
  • models → model definitions
    • guild.yml
    • abstract_model.py
    • conv_lstm → model I want to run
      • model.py → model definition
      • train.py → training script
      • config.yml → flags

Guild file

# Standard convolutional LSTM
- model: conv_lstm
  description: Convolutional LSTM
  operations:
    train_local:
      description: Train Convolutional LSTM
      sourcecode:
        - conv_lstm/train.py
        - conv_lstm/model.py
        - abstract_model.py
      requires:
        - config: conv_lstm/config.yml
        - file: ../datasets/
      main: conv_lstm/train
      flags-dest: config:conv_lstm/config.yml
      flags-import: all
      flags:
        epochs: 100
        dataset_args:
          - dataset_name: ucsd
            batch_size: 2
      output-scalars:
        train_loss: 'Train mse: (\value)'
        test_acc: 'Test mse: (\value)'
    train_remote:
      description: Train Convolutional LSTM on remote
      sourcecode:
        - conv_lstm/train.py
        - conv_lstm/model.py
        - abstract_model.py
      requires:
        - config: conv_lstm/config.yml
        - file: ../datasets/
      main: conv_lstm/train
      flags-dest: config:conv_lstm/config.yml
      flags-import: all
      flags:
        optimizer: Adam
        loss: mse
        learning_rate: 0.001
        epochs: 100
        dev: True
        gpus: [7]
        dataset_args:
          - dataset_name: ucsd
            batch_size: 2
            train_path: ~/data/ucsd/UCSDped1/Train/
            test_path: ~/data/ucsd/UCSDped1/Test/
      output-scalars:
        train_loss: 'Train mse: (\value)'
        test_acc: 'Test mse: (\value)'

Training script

sys.path.append('../')
sys.path.append('../../datasets')
# Tensorflow logging level
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

import tensorflow as tf
import yaml
from model import ConvLSTM
from datasets.data_loader import DataLoader


# Load the model configuration
class Config(object):
    def __init__(self, filename):
        self.__dict__.update(yaml.safe_load(open(filename)))


config = Config("config.yml")

(...)

Current situation & error
I’m able to run ‘conv_lstm:train_local’ without any issues, and everything works as expected. However, almost the same configuration, with a few flags changed, fails to run on remote.

Issue 1: I cannot see any evidence of the config.yml file being copied to the remote
Issue 2: the remote run fails to find the main training script, even though it works locally.

guild -H /home/bleporowski/Projects/mad/zoo run conv_lstm:train_remote --remote [remote_name] --gpus 7
You are about to run conv_lstm:train_remote as a batch (1 trial) on [remote_name]
  dataset_args: [{batch_size: 2, dataset_name: ucsd, test_path: ~/data/ucsd/UCSDped1/Test/, train_path: ~/data/ucsd/UCSDped1/Train/}]
  dev: yes
  epochs: 100
  gpus: [7]
  learning_rate: 0.001
  loss: mse
  optimizer: Adam
Continue? (Y/n) y
Building package
package src: /home/bleporowski/Projects/mad/models
package dist: /tmp/guild-remote-stage-eq7ahi7e
running clean
removing 'build/lib' (and everything under it)
removing 'build/bdist.linux-x86_64' (and everything under it)
'build/scripts-3.8' does not exist -- can't clean it
removing 'build'
running bdist_wheel
running build
running build_py
package init file '/home/bleporowski/Projects/mad/models/__init__.py' not found (or not a regular file)
creating build
creating build/lib
creating build/lib/conv_lstm
copying /home/bleporowski/Projects/mad/models/abstract_model.py -> build/lib/conv_lstm
copying /home/bleporowski/Projects/mad/models/guild.yml -> build/lib/conv_lstm
installing to build/bdist.linux-x86_64/wheel
running install
running install_lib
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/wheel
creating build/bdist.linux-x86_64/wheel/conv_lstm
copying build/lib/conv_lstm/guild.yml -> build/bdist.linux-x86_64/wheel/conv_lstm
copying build/lib/conv_lstm/abstract_model.py -> build/bdist.linux-x86_64/wheel/conv_lstm
running install_egg_info
running egg_info
writing conv_lstm.egg-info/PKG-INFO
writing dependency_links to conv_lstm.egg-info/dependency_links.txt
writing entry points to conv_lstm.egg-info/entry_points.txt
writing namespace_packages to conv_lstm.egg-info/namespace_packages.txt
writing top-level names to conv_lstm.egg-info/top_level.txt
reading manifest file 'conv_lstm.egg-info/SOURCES.txt'
writing manifest file 'conv_lstm.egg-info/SOURCES.txt'
Copying conv_lstm.egg-info to build/bdist.linux-x86_64/wheel/conv_lstm-0.0.0-py3.8.egg-info
running install_scripts
creating build/bdist.linux-x86_64/wheel/conv_lstm-0.0.0.dist-info/WHEEL
creating '/tmp/guild-remote-stage-eq7ahi7e/conv_lstm-0.0.0-py2.py3-none-any.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
adding 'conv_lstm/abstract_model.py'
adding 'conv_lstm/guild.yml'
adding 'conv_lstm-0.0.0.dist-info/METADATA'
adding 'conv_lstm-0.0.0.dist-info/PACKAGE'
adding 'conv_lstm-0.0.0.dist-info/WHEEL'
adding 'conv_lstm-0.0.0.dist-info/entry_points.txt'
adding 'conv_lstm-0.0.0.dist-info/namespace_packages.txt'
adding 'conv_lstm-0.0.0.dist-info/top_level.txt'
adding 'conv_lstm-0.0.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
Initializing remote run
Copying package
sending incremental file list
conv_lstm-0.0.0-py2.py3-none-any.whl

sent 3,558 bytes  received 35 bytes  1,437.20 bytes/sec
total size is 3,424  speedup is 0.95
Installing package and its dependencies
Processing ./conv_lstm-0.0.0-py2.py3-none-any.whl
Installing collected packages: conv-lstm
Successfully installed conv-lstm-0.0.0
Starting conv_lstm:train_remote on charybdis as 8a26ca399039412fb31c7791d293b507
WARNING: [Errno 2] No such file or directory: 'conv_lstm/config.yml'
WARNING: [Errno 2] No such file or directory: 'conv_lstm/config.yml'
WARNING: cannot import flags from conv_lstm/train: No module named conv_lstm/train
WARNING: cannot import flags from conv_lstm/train: No module named conv_lstm/train
INFO: [guild] Running trial 05afef0858f74b4198af160c6d904e2e: conv-lstm/conv_lstm:train_remote (dataset_args={batch_size: 2, dataset_name: ucsd, test_path: ~/data/ucsd/UCSDped1/Test/, train_path: ~/data/ucsd/UCSDped1/Train/}, dev=yes, epochs=100, gpus=7, learning_rate=0.001, loss=mse, optimizer=Adam)
INFO: [guild] Resolving config:conv_lstm/config.yml dependency
ERROR: [guild] Trial 05afef0858f74b4198af160c6d904e2e exited with an error: (1) run failed because a dependency was not met: could not resolve 'config:conv_lstm/config.yml' in config:conv_lstm/config.yml resource: cannot find source file 'conv_lstm/config.yml'
Run 8a26ca399039412fb31c7791d293b507 stopped with a status of 'completed'

Do remote runs require everything to become a module, with ‘init.py’? Or should the guild file be in a different location?

Sorry for the late reply here! I thought someone had replied to this but I had my messages crossed. Taking a look now.

Hi @CptPirx,

We think your issue is the lack of a data-files entry: Packages

Garrett and I discussed the unintuitive behavior that explicitly listed sourcecode and files are not being included. We’ll file a feature request for straightening this out, so that the data-files entry is only necessary for files not explicitly listed in other fields.

EDIT: issue filed at Package explicitly listed sourcecode and config entries without data-files · Issue #323 · guildai/guildai · GitHub

After adding appropriate data-files entries in the package, as suggested, everything works.

Thank you both for quick help, as always :slight_smile:

1 Like

I’ll echo @msarahan with a mea culpa — Guild should figure that stuff out and not require the extra work. We’ll get that cleaned up! Thanks for your patience.