If I run guild run outside the classification folder, having configured guild.yml according to the documentation (tried both: main: classification.train and main: classification/train), guild tells me it cannot find the classification.train module.
if I run guild run train.py from inside the classification folder, guild tells me it cannot do relative imports with no known parent package (expected I guess).
So how would I run the training script with the above project structure, argparse, and relative imports?
Hey, could also provide the guild.yml file and the error messages you get?
Maybe somebody more knowledgeable will have a better answer; here is my suggestion though:
I’d start simple and build up from that => start with just the two following files and nothing more:
guild.yml:
train:
main: classification.train
classification/train.py:
print("hello world")
now guild ops should show you the train operation and guild run train should succeed.
And if this works then you can work towards finding the issue you have
Thanks @MatejNikl , both the guild.yml and error message are as I described above:
> cat guild.yml
train:
main: classification.train
> guild run train
You are about to run train
Continue? (Y/n)
WARNING: Skipping potential source code file /Users/fspaolo/dev/sentinel-1-ee/classification/detections_test.csv because it's too big. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
WARNING: Found more than 100 source code files but will only copy 100 as a safety measure. To control which files are copied, define 'sourcecode' for the operation in a Guild file.
guild: No module named classification.train
Maybe due to the 100 sourcecode limit my modules are not being copied (which it wouldn’t make sense as the module is being specified in guild.yml… it should be the priority code to copy)?
I did figure that out actually (which seems counter intuitive as I am telling guild the path to my python code/module… so all python scripts in there should be “sourcecode” by default, no?).
Then I ran into another problem. My code is calling a git command at execution. But guild is running the code from another (temporary) folder, which is not a git repository… so I get an error: “the current directory is not a git repo”… I then change my code and disable the git command… and get another error: PathNotFoundError: nothing found at path ''. Is this because the path to the data is relative and the code is being ran from a different directory? If so, many things will break obviously. Do I have to figure out what breaks myself and come up with a workaround?
I am not doing anything out of the ordinary here. My project structure is standard, using modules, relative imports, argparse and git. So how do I run this
For that reason each time you do guild run it copies your code under a run directory made specifically for that run before guild runs the code.
Gulid does this for example so that you have a snapshot of the adhoc code that resulted e.g. in some nice results and you can actually go back to this code and see what you did there.
This run directory becomes the working directory your script will resolve all relative paths to. And since you haven’t told guild to also copy there the train data and the path you specified is relative, the train data are simply not there.
You can either specify the train data as an absolute path just to try if that works. But I would suggest to include your train data in the guild ecosystem somehow as well. At least as a project file dependency or probably rather as a run file dependency as your train data perhaps are result as some prepare-train-data operation, aren’t they?
Maybe you could go through the dependencies example to get a better understanding of how they work.
Before training it forces the user to commit the current state of the repository
At run time the code logs the commit state of the repo for that specific run
Changing to absolute paths seems to work. Now, this is not the best idea as the whole package/project is self contained with relative paths. We run this project from different virtual machines on different user environments. So paths like /User/username/project will break.
Copying the data to the guild run directly is prohibitive (and bad practice). Our data is huge, data should not be moved around, and we load data from both local machine and cloud storage depending on the situation. The cloud data takes days just to generate.
The whole guild advertisement is that you don’t need to modify your code/strategy, isn’t it? We implement all Python/ML best practices in our projects. And we do large-scale ML experiments (with several petabytes of satellite data). So question is: can Guild handle this scenario only with the addition of a configuration file? If so, is there a full example somewhere?
Guild resources can be symbolic links, so relative paths still work. Look up the resources docs.
Having a dirty git repo is not an issue under guild since guild copies the entire source code for each run. There is a command called ‘guild diff RUN_ID —working’ where you can see the difference between the run and working directory source code.
@garrett wrote a great comment on why having to commit your code for each experiment is huge pain. I’ll se if I can find it later. Guild doesn’t enforce this, which makes experimenting faster and easier.
I’ll see if I can write a full example later when I have more time.
@fspaolo you’re running into some common issues that are frustrating but there are some good reasons for them. That’s an annoying answer I know but bear with me…
ML projects almost always run operations within a source directory. Results are either logged to some project-local directory that’s included in .gitignore or logged to a temporary location. E.g. a lot of the examples you run into from Google log under /tmp.
Guild goes out of its way to prevent this. Guild runs from an empty directory, which is the run directory. Each run gets its own directory. Guild copies source code to that directory along with any other files or directories that you need. Guild does its best to guess the source code. It does not even attempt to guess other files/directories.
You’ve already seen how to control the source code spec. Guild could certainly spend a bit more time figuring out exactly what your source code is but it could still miss important source related files. So it does a light weight test and includes safeguards to prevent copying very large or very many files. If this light weight test doesn’t work, you need to specify the config, as you’ve done.
Regarding data, you want this:
train:
requires:
- file: data
target-type: link
Guild links by default to directories but this will change in the future, so it’s best to explicitly configure it that way.
As for the Git repo status, as was mentioned, Guild does record the git commit for you. If there are changed files, Guild indicates the latest commit with an asterisk. Importantly, as was also mentioned, Guild copies the source code and runs the operations using that code and not the code in your project. This is an important feature as it lets you freely modify your project source code when operations are still running without impacting those operations. It also lets you stage several operations, each with their own source code changes.
If this approach doesn’t work, please let us know and we can consider other approaches.
Paths cannot be relative within a Python package then? (I’ve tried linking to both the full and relative data paths)
Another problem is that the code generates a few directories to save some stuff (e.g. model weights, log files). Here again I have a problem with the paths:
Finally, is there a way to capture a specific variable or a print statement within the code to be displayed in the terminal, say with guild compare? I would like to quickly search for a specific run, hit Enter and have this information displayed right in the terminal?
An use case would be to get the full path to the weights, model or config file so I can copy-and-paste it as argument to my inference code (without having to search and open any files).
> guild run train
You are about to run classification:train
cycles: 3
model-name: model
no-require-clean: yes
train-data: data/detection_tiles_v1.zarr
train-index: data/detection_tiles_v1_index.zarr/train/0
Continue? (Y/n)
but raises
zarr.errors.PathNotFoundError: nothing found at path ''
Sorry for the late reply (got swamped with work)! Yes, it works so far. We are currently evaluating Guild with our ML pipelines (in order to find the “right” tool for our organization’s needs). One of the challenges we have is managing/tracking our complex data flow. We deal with large amounts of global satellite data from multiple sources, and need to be able to run (and track) the same ML code on both local and virtual machines (in the cloud). Any specific doc or discussion I could read on managing/tracking multiple runs (with the same code and data source) on different machines?
Hi @fspaolo, Guild’s current support this is similar in concept to the way git is designed — you push and pull data/content (in Guild’s case, runs) to and from machines based on your workflow. Like git, Guild doesn’t have a central point of control. Instead you orchestrate your workflows from various Guild client installations.
Unfortunately, workflows don’t come “out of the box” with Guild. You need to do some wiring based on your organizational requirements. A common workflow is to pull from a source repo and distribute runs to various nodes using the --remote option with guild run. You can then use the pull command to collect the results across the various nodes.
The alternative to this architectural approach is to use a separate orchestration tier that uses Guild to run operations and consolidates results (runs) itself. There are a bunch of these (e.g. Airflow, Kubeflow, Dask and Prefect, to some extent) — I imagine your organization has looked at some of these. In this case, Guild just becomes another job that’s run by the scheduler.
This advice is super high level and there’s more to dig into. If you’d like to get into the details, I’d be curious to know a few things:
Do you instantiate your VMs dynamically based on a job requirement, or do you have these VMs running and available across jobs?
Are you using a workflow/orchestration/scheduler currently?
Are you using any sort of CI/CD (e.g. Jenkins) tool to pull from source control and run jobs?
As @copah mentioned, Guild has a forthcoming support for DvC, which will let you pull resources from DvC, which is a way to synchronize on versioned data. You should be able to start using Guild before that however, as Guild can grab resources from URLs (which DvC provides).
You still have the somewhat complex problem of VM instantiation and job orchestration. Guild provides a specific facility for that (described at a high level above) but there are lots of other options. I’d start with what your org is looking at and see how Guild can fit into that.