Thanks @teracamo once again for your very helpful input!
I’ll just provide my two cents here.
The easiest way to make project files available to to a run is to either list them separately under requires
or to specify a parent directory.
List files:
train:
requires:
- file: data/01.npy
- file: data/02.npy
# etc
Directory:
train:
requires:
- file: data
By default, Guild copies the required files. This is an important consideration. Copies are independent of the project and so the runs are a safer/more reliable record of what happened.
You can link to project files, which saves disk space and time, especially when the required files are very large. However, this creates a long running dependency of the run on the project.
If you can afford the disk space and copy times, it’s always better to copy project dependencies.
As @teracamo mentions, you can include a select
attribute to limit copied/linked files from a directory. This applied to archives as well. The idea is that you don’t always need everything from a directory or archive and so can “select” the files that you need.
All of that said, the case that you’re describing follows a common pattern of a prepare dataset operation followed by a train operation.
Consider this example:
prepare-dataset:
main: src.dataset_script # your *.npy prep script here
train:
main: src.modelname
requires:
- operation: prepare-dataset
In this example, Guild is used to generate runs that contain an npy
file, which is the dataset used by train. Because each run is separate, you can use the same name for the data set file — e.g. data.npy
. This takes the place of the various versions 01.npy
, 02.npy
, etc.
Using Guild to generate these files gives you the benefit of tracking the source code and flags used to create the datasets.
If your dataset prep script requires project files (e.g. raw datasets, etc.) you can include those files as dependencies. If those dependencies exist on a network server somewhere, you can specify a dependency using a URL (e.g. https://your.server.com/raw-data-v1.0.csv
).
When you run train
, Guild looks for a run for prepare-data
and links the generated files from that run in the train run directory. So your train script can read data.npy
from the current directory.
By default Guild automatically uses the latest non-error run for prepare-dataset
. You can alternative specify a run ID for the dataset run that you want to use. This is comparable to specifying the dataset
flag from your example. If you write your prepared dataset as data.npy
(or some other consistent name) you can use that name to read each time. Guild does the work of resolving the specific dataset file for you when it resolves the operation dependency.
Here’s an example.
First run prepare-dataset
:
guild run prepare-dataset <some flags> -y # generates `data.npy`
Then run train
:
guild run train -y
Guild resolves the dataset requirement by linking the files generated by prepare-data
and train can read those from its run directory.
If you want to use this pattern for files that you’ve already created (i.e. the files under data
) you can use a prepare-dataset
operation that copies a specific data file from data
and makes it available either as is or using a normalized name (e.g. data.npy
).
Here’s an example — I’ll use the operation name dataset
since we’re not really preparing anything, just copying.
dataset:
main: guild.pass
flags:
name:
requires: yes
description: Name of the dataset
requires:
- file: data/${name}.npy
rename: *.npy data.npy
target-type: copy # see note below
In this pattern, I’m copying the file, not linking. This creates one copy of the project file for a run. That file in turn can be used by multiple train operations that links. This is how you can make your runs independent of your project.
Here’s a sample project that covers all of these examples in concrete ways:
I appreciate that this all sounds pretty complicated. There are trade offs for each of the approaches, which can leads to some headaches when thinking through things.
Here are the principles though that I like to maintain:
-
Runs should be independent from projects. Avoid linking to project files if you can.
-
While it’s common to generate and store data set files in project, consider using runs. Data sets are not that different from trained models in the sense you want to carefully track how they’re created. Runs are better than doing this than stored project files.