Guild run can't find module/relative import

Hello,

I have my project structured as follows:

classification/
├── README.md
├── __init__.py
├── coordinate_conv.py
├── cosine_annealing.py
├── data_generator.py
├── deformable_conv.py
├── drop_block.py
├── nnet_blocks.py
├── infer.py
├── model.py
├── train.py
└── utils.py

Where every python script is a module, and arguments are passed with argparse. So I would run train.py as

python -m classification.train \
    --model-name model \
    --train-data path/to/train/data \
    --cycles 3 \
    --no-require-clean
  1. If I run guild run outside the classification folder, having configured guild.yml according to the documentation (tried both: main: classification.train and main: classification/train), guild tells me it cannot find the classification.train module.

  2. if I run guild run train.py from inside the classification folder, guild tells me it cannot do relative imports with no known parent package (expected I guess).

So how would I run the training script with the above project structure, argparse, and relative imports?

Thanks!
-fernando

Hey, could also provide the guild.yml file and the error messages you get?

Maybe somebody more knowledgeable will have a better answer; here is my suggestion though:

I’d start simple and build up from that => start with just the two following files and nothing more:
guild.yml:

train:
  main: classification.train

classification/train.py:

print("hello world")

now guild ops should show you the train operation and guild run train should succeed.
And if this works then you can work towards finding the issue you have :slight_smile:

Thanks @MatejNikl , both the guild.yml and error message are as I described above:

> cat guild.yml
train:
  main: classification.train
> guild run train
You are about to run train
Continue? (Y/n)
WARNING: Skipping potential source code file /Users/fspaolo/dev/sentinel-1-ee/classification/detections_test.csv because it's too big. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
WARNING: Found more than 100 source code files but will only copy 100 as a safety measure. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
guild: No module named classification.train

Maybe due to the 100 sourcecode limit my modules are not being copied (which it wouldn’t make sense as the module is being specified in guild.yml… it should be the priority code to copy)?

You have to specify the sourcecode.

train:
  sourcecode:
    - '*.py'

I did figure that out actually (which seems counter intuitive as I am telling guild the path to my python code/module… so all python scripts in there should be “sourcecode” by default, no?).

Then I ran into another problem. My code is calling a git command at execution. But guild is running the code from another (temporary) folder, which is not a git repository… so I get an error: “the current directory is not a git repo”… I then change my code and disable the git command… and get another error: PathNotFoundError: nothing found at path ''. Is this because the path to the data is relative and the code is being ran from a different directory? If so, many things will break obviously. Do I have to figure out what breaks myself and come up with a workaround?

I am not doing anything out of the ordinary here. My project structure is standard, using modules, relative imports, argparse and git. So how do I run this

python -m classification.train \
    --model-name model \
    --train-data path/to/train/data \
    --cycles 3 \
    --no-require-clean

with guild (without significant code changes)?

Thanks.

You have to specify your data as a guild resource.

What is your use case for your git command?

EDIT: To debug your guild run folder try ‘guild ls’ to show what is being placed in the runs folder.

From Guild Docs – Runs:

Runs serve as a unit of reproducibility .

For that reason each time you do guild run it copies your code under a run directory made specifically for that run before guild runs the code.
Gulid does this for example so that you have a snapshot of the adhoc code that resulted e.g. in some nice results and you can actually go back to this code and see what you did there.

This run directory becomes the working directory your script will resolve all relative paths to. And since you haven’t told guild to also copy there the train data and the path you specified is relative, the train data are simply not there.

You can either specify the train data as an absolute path just to try if that works. But I would suggest to include your train data in the guild ecosystem somehow as well. At least as a project file dependency or probably rather as a run file dependency as your train data perhaps are result as some prepare-train-data operation, aren’t they?

Maybe you could go through the dependencies example to get a better understanding of how they work.

@copah @MatejNikl Some comments:

Use case for Git:

  • Before training it forces the user to commit the current state of the repository
  • At run time the code logs the commit state of the repo for that specific run

Changing to absolute paths seems to work. Now, this is not the best idea as the whole package/project is self contained with relative paths. We run this project from different virtual machines on different user environments. So paths like /User/username/project will break.

Copying the data to the guild run directly is prohibitive (and bad practice). Our data is huge, data should not be moved around, and we load data from both local machine and cloud storage depending on the situation. The cloud data takes days just to generate.

The whole guild advertisement is that you don’t need to modify your code/strategy, isn’t it? We implement all Python/ML best practices in our projects. And we do large-scale ML experiments (with several petabytes of satellite data). So question is: can Guild handle this scenario only with the addition of a configuration file? If so, is there a full example somewhere?

Thanks,
-fernando

Some quick comments:

Guild logs the git sha for you automatically.

Guild resources can be symbolic links, so relative paths still work. Look up the resources docs.

Having a dirty git repo is not an issue under guild since guild copies the entire source code for each run. There is a command called ‘guild diff RUN_ID —working’ where you can see the difference between the run and working directory source code.

@garrett wrote a great comment on why having to commit your code for each experiment is huge pain. I’ll se if I can find it later. Guild doesn’t enforce this, which makes experimenting faster and easier.

I’ll see if I can write a full example later when I have more time.

Thanks @copah and @MatejNikl for your excellent answers!

@fspaolo you’re running into some common issues that are frustrating but there are some good reasons for them. That’s an annoying answer I know but bear with me…

ML projects almost always run operations within a source directory. Results are either logged to some project-local directory that’s included in .gitignore or logged to a temporary location. E.g. a lot of the examples you run into from Google log under /tmp.

Guild goes out of its way to prevent this. Guild runs from an empty directory, which is the run directory. Each run gets its own directory. Guild copies source code to that directory along with any other files or directories that you need. Guild does its best to guess the source code. It does not even attempt to guess other files/directories.

You’ve already seen how to control the source code spec. Guild could certainly spend a bit more time figuring out exactly what your source code is but it could still miss important source related files. So it does a light weight test and includes safeguards to prevent copying very large or very many files. If this light weight test doesn’t work, you need to specify the config, as you’ve done.

Regarding data, you want this:

train:
  requires:
    - file: data
      target-type: link

Guild links by default to directories but this will change in the future, so it’s best to explicitly configure it that way.

As for the Git repo status, as was mentioned, Guild does record the git commit for you. If there are changed files, Guild indicates the latest commit with an asterisk. Importantly, as was also mentioned, Guild copies the source code and runs the operations using that code and not the code in your project. This is an important feature as it lets you freely modify your project source code when operations are still running without impacting those operations. It also lets you stage several operations, each with their own source code changes.

If this approach doesn’t work, please let us know and we can consider other approaches.

Thanks @garrett @copah @MatejNikl for elaborating.

I’m still unable to obtain a working example with minimal code modification. For example:

This doesn’t ​work

flags-dest: args
flags:
  train-data: data/file.ext

but this does

flags-dest: args
flags:
  train-data: /Users/username/dev/project/package/data/file.ext

Paths cannot be relative within a Python package then? (I’ve tried linking to both the full and relative data paths)

Another problem is that the code generates a few directories to save some stuff (e.g. model weights, log files). Here again I have a problem with the paths:

What should be

untracked/runs/20210804T185741/wts/weights.wts

becomes

/Users/username/anaconda3/envs/envname/.guild/runs/43bba88802c241ed81bc70bc7d95cf52/.guild/sourcecode/untracked/runs/20210804T185741/wts/weights.wts

How can I keep my original path?

Finally, is there a way to capture a specific variable or a print statement within the code to be displayed in the terminal, say with guild compare? I would like to quickly search for a specific run, hit Enter and have this information displayed right in the terminal?

An use case would be to get the full path to the weights, model or config file so I can copy-and-paste it as argument to my inference code (without having to search and open any files).

Thanks!
-fernando

@fspaolo

You shouldn’t use the flags attribute for specifying data resources.

Here’s an example on how to use the resources attribute to specify data dependencies via symbolic links:

- model: my_model_name
  operations:
    train:
      flags:
        model_name:
          default: "default_model_name"
      requires:
        - prepared_data
  resources:
    prepared_data:
      - file: data/my_data_folder
        target-type: link
        target-path: data/

guild will link to the data/my_data_folder in the new guild run folder and it will all work with relative paths.

@copah

I’m confused now. The path to the data and indices files are arguments to the training code (parsed through argparse):

usage: train.py [-h] --model-name MODEL_NAME --train-data TRAIN_DATA
                     --train-index TRAIN_INDEX [--cycles CYCLES]
                     [--cfg-dir CFG_DIR] [--log-level LOG_LEVEL]
                     [--no-require-clean]

train.py: error: the following arguments are required: --train-data, --train-index

According to the documentation I should parse these command-line arguments as

flags-dest: args
    flags:
        arg1: myarg1
        arg2: myarg2

…or not?

And what’s - prepared_data? I can’t get your example to work (it keeps showing the paths as empty strings "").

This is my full guild.yml

- model: classification
  operations:
    train:
      main: classification.train
      flags-dest: args
      flags:
        model-name: model
        train-data: data/detection_tiles_v1.zarr
        train-index: data/detection_tiles_v1_index.zarr/train/0
        cycles: 3
        no-require-clean: null
      requires:
        - prepared_data
      sourcecode:
        - '*.py'
  resources:
    prepared_data:
      - file: data/detection_tiles_v1.zarr
      - file: data/detection_tiles_v1_index.zarr
        target-type: link
        target-path: data/

which shows

> guild run train
You are about to run classification:train
  cycles: 3
  model-name: model
  no-require-clean: yes
  train-data: data/detection_tiles_v1.zarr
  train-index: data/detection_tiles_v1_index.zarr/train/0
Continue? (Y/n)

but raises

zarr.errors.PathNotFoundError: nothing found at path ''

It’s difficult to debug without the code, but you might have to specify the target-path for the data/detection_tiles_v1.zarr as well:

- model: classification
  operations:
    train:
      main: classification.train
      flags-dest: args
      flags:
        model-name: model
        train-data: data/detection_tiles_v1.zarr
        train-index: data/detection_tiles_v1_index.zarr/train/0
        cycles: 3
        no-require-clean: null
      requires:
        - prepared_data
      sourcecode:
        - '*.py'
  resources:
    prepared_data:
      - file: data/detection_tiles_v1.zarr
        target-type: link
        target-path: data/ 
      - file: data/detection_tiles_v1_index.zarr
        target-type: link
        target-path: data/ 

To debug file paths etc try run guild ls RUN_ID then you can see what the folder structure looks like from the guild folder.

@fspaolo

Did it work for you?