How to require multiple files from the same directory in guild.yml?

I want to call something like data/* under required files. Is this possible?

Basically, I have different datasets as parameters to my problem that I call via flags. Listing each dataset by name is not robust. Can I require all of them somehow? Thanks!

- model: modelname
  operations:
    train:
      sourcecode:
        - exclude: '*.ipynb'
        - exclude: '*.md'
        - exclude: '.git*'
      requires:                                                                                                                                         
        - file: figs
          target-type: link
        - file: data/*
          target-type: link
      main: src.modelname
      output-scalars:
        - step: 'epoch: (\step)'
        - train_loss: 'train_loss: (\value)'
        - loss: 'valid_loss: (\value)'
      flags:
         dataset = '01.npy'
      ...

If it’s not clear, my current guild.yml has a requirement section that looks vaguely like this:

      requires:                                                                                                                                         
        - file: figs
          target-type: link
        - file: data/01.npy
          target-type: link
        - file: data/02.npy
          target-type: link
        - file: data/03.npy
          target-type: link

where in my case each .npy file is a slightly different dataset generated by another script. I’m not sure what’s the best way to organize the workflow when I need to generate these new datasets. I could append each one here (i.e. add 04.npy and 05.npy and so on), but I was hoping there was a cleaner solution. Thanks!

You can do it with the select directive of file attribute:

resources:
  default:
    - file: './data/'
      select:
        - '.*\.npy' # regex pattern
      target-type: link
      target-path: 'data'  # if this is abscent, they will be placed under top level of run directory

Thank you! This was very helpful.


For future reference, the cheatsheet has a section under Dependencies describing the resources and select directive.

I needed to add an appropriate flag under requires to point to the information under resources, so the code ended up like this.


- model: modelname
  operations:
    train:
      sourcecode:
        - exclude: '*.ipynb'
        - exclude: '*.md'
        - exclude: '.git*'
      requires: data
      main: src.modelname
      output-scalars:
        - step: 'epoch: (\step)'
        - train_loss: 'train_loss: (\value)'
        - loss: 'valid_loss: (\value)'
      flags:
         dataset = '01.npy'
  resources:
    data:
      - file: './path/to/data/'
        select:
          - '.*\.npy' # regex pattern
        target-type: link
        target-path: 'path/to/data/'
1 Like

Glad that helped. Sorry I missed out that part, you are right it has to be add back under requires.

In general, Guildai is pretty well designed, but the documents is still being furnished. Apparently, the resources and sourcecode eventually points to the same resolver so their syntax are similar, except for sourcecode you gets attribute root whereas its files in resources. You can always try out things to see if it works.

Thanks @teracamo once again for your very helpful input!

I’ll just provide my two cents here.

The easiest way to make project files available to to a run is to either list them separately under requires or to specify a parent directory.

List files:

train:
  requires:
    - file: data/01.npy
    - file: data/02.npy
    # etc

Directory:

train:
  requires:
    - file: data

By default, Guild copies the required files. This is an important consideration. Copies are independent of the project and so the runs are a safer/more reliable record of what happened.

You can link to project files, which saves disk space and time, especially when the required files are very large. However, this creates a long running dependency of the run on the project.

If you can afford the disk space and copy times, it’s always better to copy project dependencies.

As @teracamo mentions, you can include a select attribute to limit copied/linked files from a directory. This applied to archives as well. The idea is that you don’t always need everything from a directory or archive and so can “select” the files that you need.

All of that said, the case that you’re describing follows a common pattern of a prepare dataset operation followed by a train operation.

Consider this example:

prepare-dataset:
  main: src.dataset_script  # your *.npy prep script here

train:
  main: src.modelname
  requires:
    - operation: prepare-dataset

In this example, Guild is used to generate runs that contain an npy file, which is the dataset used by train. Because each run is separate, you can use the same name for the data set file — e.g. data.npy. This takes the place of the various versions 01.npy, 02.npy, etc.

Using Guild to generate these files gives you the benefit of tracking the source code and flags used to create the datasets.

If your dataset prep script requires project files (e.g. raw datasets, etc.) you can include those files as dependencies. If those dependencies exist on a network server somewhere, you can specify a dependency using a URL (e.g. https://your.server.com/raw-data-v1.0.csv).

When you run train, Guild looks for a run for prepare-data and links the generated files from that run in the train run directory. So your train script can read data.npy from the current directory.

By default Guild automatically uses the latest non-error run for prepare-dataset. You can alternative specify a run ID for the dataset run that you want to use. This is comparable to specifying the dataset flag from your example. If you write your prepared dataset as data.npy (or some other consistent name) you can use that name to read each time. Guild does the work of resolving the specific dataset file for you when it resolves the operation dependency.

Here’s an example.

First run prepare-dataset:

guild run prepare-dataset <some flags> -y  # generates `data.npy`

Then run train:

guild run train -y

Guild resolves the dataset requirement by linking the files generated by prepare-data and train can read those from its run directory.

If you want to use this pattern for files that you’ve already created (i.e. the files under data) you can use a prepare-dataset operation that copies a specific data file from data and makes it available either as is or using a normalized name (e.g. data.npy).

Here’s an example — I’ll use the operation name dataset since we’re not really preparing anything, just copying.

dataset:
  main: guild.pass
  flags:
    name:
      requires: yes
      description: Name of the dataset
  requires:
    - file: data/${name}.npy
      rename: *.npy data.npy
      target-type: copy  # see note below

In this pattern, I’m copying the file, not linking. This creates one copy of the project file for a run. That file in turn can be used by multiple train operations that links. This is how you can make your runs independent of your project.

Here’s a sample project that covers all of these examples in concrete ways:

I appreciate that this all sounds pretty complicated. There are trade offs for each of the approaches, which can leads to some headaches when thinking through things.

Here are the principles though that I like to maintain:

  • Runs should be independent from projects. Avoid linking to project files if you can.

  • While it’s common to generate and store data set files in project, consider using runs. Data sets are not that different from trained models in the sense you want to carefully track how they’re created. Runs are better than doing this than stored project files.