Data Filepath Flag

Hello, I’m converting a project to guild and I have a question about setting a filepath as a flag.

At the top of my training script I have a variable data_fp = “…/…/data/dtype2/processed_data.npy”
before this was a guild project I was just cutting and pasting different filepaths when I want to train the model on new data but now I want to be able to set it as a flag.

When I try to run this with guild I get an error because it doesn’t see that path. Is there a way to do set this up so that the script can see the data from the /run directory?

Also, currently my folder structure is like this and my guild file is :

proj/
  data/
    dtype1/
      raw_data.npy
    dtype2/
      processed_data.npy
  scripts/
    guild.yml
    data_processing/
      process_data.py
    model/
      train.py
      ...

The easiest way is to make the data directory available to the run using a dependency.

# guild.yml

test:
  requires:
    - file: data

Then define a flag that uses the relative path to the file. E.g.

# test.py

data = "data/bar.txt"
print(open(data).read())

A working example of this is here:

There are a couple other approaches that occur to me but lets start with this one as it’s the most straight forward. If you run into issues or have questions, just ask here and we can work through them.

I just changed the guild file to look like this but it isn’t working yet. This is what the guild file looks like now.

train:
  description: Train a flow model based on a data file and optionally save the parameters
  main: scripts/flow_model/optim_flow_model
  flags-dest: globals
  flags-import: all
  requires:
    - file: data/saved_npy

That will create saved_py in the run directory. You can confirm this by running:

guild ls

That’s a good way to see what Guild creates in the run directory. Your script runs in that location, so if it can’t find something, it’s either because it’s not there or your script expects it in another location.

You have a few options:

  1. Just link to the data directory (omit saved_npy) — this will make the entire data tree available. Currently Guild symlinks to the directory so you’re not copying any files there.
train:
  requires:
    - file: data
  1. Specify a target-path of data so that the saved_npy directory is accessible as data/saved_npy.
train:
  requires:
    - file data/saved_npy
      target-path: data
  1. Modify your script to look in saved_npy.

If you’re running into another issue, what is the error message you’re getting?

I went with option two and it worked, thanks for your help!

1 Like