Guild run can't find module/relative import

Hello,

I have my project structured as follows:

classification/
├── README.md
├── __init__.py
├── coordinate_conv.py
├── cosine_annealing.py
├── data_generator.py
├── deformable_conv.py
├── drop_block.py
├── nnet_blocks.py
├── infer.py
├── model.py
├── train.py
└── utils.py

Where every python script is a module, and arguments are passed with argparse. So I would run train.py as

python -m classification.train \
    --model-name model \
    --train-data path/to/train/data \
    --cycles 3 \
    --no-require-clean
  1. If I run guild run outside the classification folder, having configured guild.yml according to the documentation (tried both: main: classification.train and main: classification/train), guild tells me it cannot find the classification.train module.

  2. if I run guild run train.py from inside the classification folder, guild tells me it cannot do relative imports with no known parent package (expected I guess).

So how would I run the training script with the above project structure, argparse, and relative imports?

Thanks!
-fernando

Hey, could also provide the guild.yml file and the error messages you get?

Maybe somebody more knowledgeable will have a better answer; here is my suggestion though:

I’d start simple and build up from that => start with just the two following files and nothing more:
guild.yml:

train:
  main: classification.train

classification/train.py:

print("hello world")

now guild ops should show you the train operation and guild run train should succeed.
And if this works then you can work towards finding the issue you have :slight_smile:

Thanks @MatejNikl , both the guild.yml and error message are as I described above:

> cat guild.yml
train:
  main: classification.train
> guild run train
You are about to run train
Continue? (Y/n)
WARNING: Skipping potential source code file /Users/fspaolo/dev/sentinel-1-ee/classification/detections_test.csv because it's too big. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
WARNING: Found more than 100 source code files but will only copy 100 as a safety measure. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
guild: No module named classification.train

Maybe due to the 100 sourcecode limit my modules are not being copied (which it wouldn’t make sense as the module is being specified in guild.yml… it should be the priority code to copy)?

You have to specify the sourcecode.

train:
  sourcecode:
    - '*.py'

I did figure that out actually (which seems counter intuitive as I am telling guild the path to my python code/module… so all python scripts in there should be “sourcecode” by default, no?).

Then I ran into another problem. My code is calling a git command at execution. But guild is running the code from another (temporary) folder, which is not a git repository… so I get an error: “the current directory is not a git repo”… I then change my code and disable the git command… and get another error: PathNotFoundError: nothing found at path ''. Is this because the path to the data is relative and the code is being ran from a different directory? If so, many things will break obviously. Do I have to figure out what breaks myself and come up with a workaround?

I am not doing anything out of the ordinary here. My project structure is standard, using modules, relative imports, argparse and git. So how do I run this

python -m classification.train \
    --model-name model \
    --train-data path/to/train/data \
    --cycles 3 \
    --no-require-clean

with guild (without significant code changes)?

Thanks.

You have to specify your data as a guild resource.

What is your use case for your git command?

EDIT: To debug your guild run folder try ‘guild ls’ to show what is being placed in the runs folder.

From Guild Docs – Runs:

Runs serve as a unit of reproducibility .

For that reason each time you do guild run it copies your code under a run directory made specifically for that run before guild runs the code.
Gulid does this for example so that you have a snapshot of the adhoc code that resulted e.g. in some nice results and you can actually go back to this code and see what you did there.

This run directory becomes the working directory your script will resolve all relative paths to. And since you haven’t told guild to also copy there the train data and the path you specified is relative, the train data are simply not there.

You can either specify the train data as an absolute path just to try if that works. But I would suggest to include your train data in the guild ecosystem somehow as well. At least as a project file dependency or probably rather as a run file dependency as your train data perhaps are result as some prepare-train-data operation, aren’t they?

Maybe you could go through the dependencies example to get a better understanding of how they work.

@copah @MatejNikl Some comments:

Use case for Git:

  • Before training it forces the user to commit the current state of the repository
  • At run time the code logs the commit state of the repo for that specific run

Changing to absolute paths seems to work. Now, this is not the best idea as the whole package/project is self contained with relative paths. We run this project from different virtual machines on different user environments. So paths like /User/username/project will break.

Copying the data to the guild run directly is prohibitive (and bad practice). Our data is huge, data should not be moved around, and we load data from both local machine and cloud storage depending on the situation. The cloud data takes days just to generate.

The whole guild advertisement is that you don’t need to modify your code/strategy, isn’t it? We implement all Python/ML best practices in our projects. And we do large-scale ML experiments (with several petabytes of satellite data). So question is: can Guild handle this scenario only with the addition of a configuration file? If so, is there a full example somewhere?

Thanks,
-fernando

Some quick comments:

Guild logs the git sha for you automatically.

Guild resources can be symbolic links, so relative paths still work. Look up the resources docs.

Having a dirty git repo is not an issue under guild since guild copies the entire source code for each run. There is a command called ‘guild diff RUN_ID —working’ where you can see the difference between the run and working directory source code.

@garrett wrote a great comment on why having to commit your code for each experiment is huge pain. I’ll se if I can find it later. Guild doesn’t enforce this, which makes experimenting faster and easier.

I’ll see if I can write a full example later when I have more time.

Thanks @copah and @MatejNikl for your excellent answers!

@fspaolo you’re running into some common issues that are frustrating but there are some good reasons for them. That’s an annoying answer I know but bear with me…

ML projects almost always run operations within a source directory. Results are either logged to some project-local directory that’s included in .gitignore or logged to a temporary location. E.g. a lot of the examples you run into from Google log under /tmp.

Guild goes out of its way to prevent this. Guild runs from an empty directory, which is the run directory. Each run gets its own directory. Guild copies source code to that directory along with any other files or directories that you need. Guild does its best to guess the source code. It does not even attempt to guess other files/directories.

You’ve already seen how to control the source code spec. Guild could certainly spend a bit more time figuring out exactly what your source code is but it could still miss important source related files. So it does a light weight test and includes safeguards to prevent copying very large or very many files. If this light weight test doesn’t work, you need to specify the config, as you’ve done.

Regarding data, you want this:

train:
  requires:
    - file: data
      target-type: link

Guild links by default to directories but this will change in the future, so it’s best to explicitly configure it that way.

As for the Git repo status, as was mentioned, Guild does record the git commit for you. If there are changed files, Guild indicates the latest commit with an asterisk. Importantly, as was also mentioned, Guild copies the source code and runs the operations using that code and not the code in your project. This is an important feature as it lets you freely modify your project source code when operations are still running without impacting those operations. It also lets you stage several operations, each with their own source code changes.

If this approach doesn’t work, please let us know and we can consider other approaches.

Thanks @garrett @copah @MatejNikl for elaborating.

I’m still unable to obtain a working example with minimal code modification. For example:

This doesn’t ​work

flags-dest: args
flags:
  train-data: data/file.ext

but this does

flags-dest: args
flags:
  train-data: /Users/username/dev/project/package/data/file.ext

Paths cannot be relative within a Python package then? (I’ve tried linking to both the full and relative data paths)

Another problem is that the code generates a few directories to save some stuff (e.g. model weights, log files). Here again I have a problem with the paths:

What should be

untracked/runs/20210804T185741/wts/weights.wts

becomes

/Users/username/anaconda3/envs/envname/.guild/runs/43bba88802c241ed81bc70bc7d95cf52/.guild/sourcecode/untracked/runs/20210804T185741/wts/weights.wts

How can I keep my original path?

Finally, is there a way to capture a specific variable or a print statement within the code to be displayed in the terminal, say with guild compare? I would like to quickly search for a specific run, hit Enter and have this information displayed right in the terminal?

An use case would be to get the full path to the weights, model or config file so I can copy-and-paste it as argument to my inference code (without having to search and open any files).

Thanks!
-fernando

@fspaolo

You shouldn’t use the flags attribute for specifying data resources.

Here’s an example on how to use the resources attribute to specify data dependencies via symbolic links:

- model: my_model_name
  operations:
    train:
      flags:
        model_name:
          default: "default_model_name"
      requires:
        - prepared_data
  resources:
    prepared_data:
      - file: data/my_data_folder
        target-type: link
        target-path: data/

guild will link to the data/my_data_folder in the new guild run folder and it will all work with relative paths.

@copah

I’m confused now. The path to the data and indices files are arguments to the training code (parsed through argparse):

usage: train.py [-h] --model-name MODEL_NAME --train-data TRAIN_DATA
                     --train-index TRAIN_INDEX [--cycles CYCLES]
                     [--cfg-dir CFG_DIR] [--log-level LOG_LEVEL]
                     [--no-require-clean]

train.py: error: the following arguments are required: --train-data, --train-index

According to the documentation I should parse these command-line arguments as

flags-dest: args
    flags:
        arg1: myarg1
        arg2: myarg2

…or not?

And what’s - prepared_data? I can’t get your example to work (it keeps showing the paths as empty strings "").

This is my full guild.yml

- model: classification
  operations:
    train:
      main: classification.train
      flags-dest: args
      flags:
        model-name: model
        train-data: data/detection_tiles_v1.zarr
        train-index: data/detection_tiles_v1_index.zarr/train/0
        cycles: 3
        no-require-clean: null
      requires:
        - prepared_data
      sourcecode:
        - '*.py'
  resources:
    prepared_data:
      - file: data/detection_tiles_v1.zarr
      - file: data/detection_tiles_v1_index.zarr
        target-type: link
        target-path: data/

which shows

> guild run train
You are about to run classification:train
  cycles: 3
  model-name: model
  no-require-clean: yes
  train-data: data/detection_tiles_v1.zarr
  train-index: data/detection_tiles_v1_index.zarr/train/0
Continue? (Y/n)

but raises

zarr.errors.PathNotFoundError: nothing found at path ''

It’s difficult to debug without the code, but you might have to specify the target-path for the data/detection_tiles_v1.zarr as well:

- model: classification
  operations:
    train:
      main: classification.train
      flags-dest: args
      flags:
        model-name: model
        train-data: data/detection_tiles_v1.zarr
        train-index: data/detection_tiles_v1_index.zarr/train/0
        cycles: 3
        no-require-clean: null
      requires:
        - prepared_data
      sourcecode:
        - '*.py'
  resources:
    prepared_data:
      - file: data/detection_tiles_v1.zarr
        target-type: link
        target-path: data/ 
      - file: data/detection_tiles_v1_index.zarr
        target-type: link
        target-path: data/ 

To debug file paths etc try run guild ls RUN_ID then you can see what the folder structure looks like from the guild folder.

@fspaolo

Did it work for you?

@copah

Sorry for the late reply (got swamped with work)! Yes, it works so far. We are currently evaluating Guild with our ML pipelines (in order to find the “right” tool for our organization’s needs). One of the challenges we have is managing/tracking our complex data flow. We deal with large amounts of global satellite data from multiple sources, and need to be able to run (and track) the same ML code on both local and virtual machines (in the cloud). Any specific doc or discussion I could read on managing/tracking multiple runs (with the same code and data source) on different machines?

Thank you!

@fspaolo

Have a look at this discussion: Data versioning

We use a combination of DvC and guild, but will migrate to guild completely when support for DvC tracked files arrives.

Hi @fspaolo, Guild’s current support this is similar in concept to the way git is designed — you push and pull data/content (in Guild’s case, runs) to and from machines based on your workflow. Like git, Guild doesn’t have a central point of control. Instead you orchestrate your workflows from various Guild client installations.

Unfortunately, workflows don’t come “out of the box” with Guild. You need to do some wiring based on your organizational requirements. A common workflow is to pull from a source repo and distribute runs to various nodes using the --remote option with guild run. You can then use the pull command to collect the results across the various nodes.

The alternative to this architectural approach is to use a separate orchestration tier that uses Guild to run operations and consolidates results (runs) itself. There are a bunch of these (e.g. Airflow, Kubeflow, Dask and Prefect, to some extent) — I imagine your organization has looked at some of these. In this case, Guild just becomes another job that’s run by the scheduler.

This advice is super high level and there’s more to dig into. If you’d like to get into the details, I’d be curious to know a few things:

  • Do you instantiate your VMs dynamically based on a job requirement, or do you have these VMs running and available across jobs?
  • Are you using a workflow/orchestration/scheduler currently?
  • Are you using any sort of CI/CD (e.g. Jenkins) tool to pull from source control and run jobs?

As @copah mentioned, Guild has a forthcoming support for DvC, which will let you pull resources from DvC, which is a way to synchronize on versioned data. You should be able to start using Guild before that however, as Guild can grab resources from URLs (which DvC provides).

You still have the somewhat complex problem of VM instantiation and job orchestration. Guild provides a specific facility for that (described at a high level above) but there are lots of other options. I’d start with what your org is looking at and see how Guild can fit into that.