Using data generated from previous step in Pipeline

In a pipeline, steps has its own folder directory and run key, the structure as far as I observe is like this:

├──pipeline_rundir
   ├── symlink_to_step1_rundir
   │   ├── requested_resources_1
   │   └── run_generated_resources
   └── symlink_to_step2_rundir
       └── requested_resources_2

Is it possible to forward the resources generated run_generated_resources in step 1 to the directory for step 2 without opening a shared folder (such that each pipeline run can pass the trained network state to the validation step).

If I understand you correctly, this should happen via dependencies. E.g. requested_resources_2 I assume is setup by Guild through a dependency. You can do the same for run_generated_resources using an operation dependency.

The pipeline is not really a part of that relationship. You can resolve all required resources simply by running the downstream run - in this example step2.

Let’s set aside the pipeline operation for the time being.

step1:
  requires:
    - file: requested_resources_1
step2:
  requires:
    - file: requestd_resources_2
    - operation: step1
      select: run_generated_resources

The select is optional. You use that to link to only matching files generated by step1.

When you run step2, Guild looks for a run for step1 and creates links to the selected files from that run.

1 Like

Thanks, this does exactly what I need.

But I think it behave slightly different to the first level select and it as it doesn’t glob with wildcard e.g. select: '*.txt, not sure if intended behavior, so I have to generate it to a new folder relative to the run and select that folder instead.

What is really the advange of dependencies? If I have 2 steps (say prepare and train) and I save file from the first step in some directory and than load this same file in train, and combine everything in the pipeline, why would I need dependency, when everyiting is chained together by saving/loading files through this 2 steps?

In my case, I generates different outputs in each runs from the first step and than evaluate it in the second. So the generated files are “runs dependent” and I cannot track them by saving them to the same folder because I do parallel running so there’s a chance the output files will be covered before I can do analysis in this setting.

I can also save them for further use somewhere else with good data organization.

So you create specific folder for every run to save data from first step into it? How do you name folders for every run?

I just dump everything into a folder named ‘Output’ and don’t have specific naming system for each run. Which is why guild so useful to me because it create sort of separate containers for me so I can systematically store the data with ease and without having to worry accidently covering the previous outputs.

I understand guild saves flags and scalars in guild home folder. But it doesn’t save data. If data is saved to output folder and if you run two runs at the same time (in parallel as you said), than it is possible that one run will be overridden by one of the runs.

Basically in each guild run, it creates a container automatically for you under (by default) you python env directory, with the directory named with 256-bit key generated presumably randomly. It that copies all the source-code to that container’s directory and run the code there, any relative directories will be relative to that container.

For example the first run, its initiated in .guild/[1st_run_256-bit_label], if you generate your data with relative dirs like this file('./output.txt', 'w'), it will be generated at .guild/[1st_run_256-bit_label]/output.txt. If you run parrallel the 2nd run, running the exact same code, it will be generated at .guild/[2nd_run_256-bit_label]/output.txt. So there won’t be overwritting. You are right that guild won’t be managing your outputs though, you still need to process that yourself.

I am no expert really but I learn how guild build these folder structure by actually browsing the guild home dir. You can find the directory of each run by guild open [num] --cmd "echo" on linux platform.

The value for select is applied as a regular expression. So you’d use this:

downstream:
  requires:
    - operation: upstream
      select: .*\.txt

Regular expressions are a bit painful for simple cases like this (we have an issue that proposes supporting glob wildcards) but it comes in handy for anything even slightly complicated.

@teracamo, how do you then load that data in the second step? The second step created another directory.

The pipeline really doesn’t do any sort of roll-up — it just runs the steps and creates links to each step run directory. The pipeline e.g. doesn’t contain the step runs. Guild doesn’t nest runs — everything is in a flat list under $GUILD_HOME/runs. If runs are related, as in the case of a pipeline and its steps, the relationship is expressed using links.

If an operation needs a file, that file must be resolved using dependencies. There’s no clean way around this. This makes dependencies quite important to Guild.

Guild doesn’t save data, that’s technically true — your script saves data. But the intent is 100% that a run contains saved data. That’s really the essence of an experiment. It’s not just the flags and scalars but also any generated file.

When Guild generates a run, it creates a unique directory using a UUID. There’s no chance of overwriting files provided your script saves to a relative path. Guild makes no attempt to jail an operation process. You’re free to save files wherever you want, including outside the run directory. But Guild runs the operation process using the run directory as the current working directory (e.g. os.getcwd()). So if you save a file like this:

open("hello.txt", "w").write("hello!")

you’ll see that run show up in the run directory. You can use guild ls to list these files.

So in this sense, Guild very much “saves data”. Because each run uses a unique directory, you’re perfectly safe in writing whatever you want, again, provided you use a relative path.

You’re duplicating what Guild does :slight_smile:

If you haven’t already, I suggest that you execute the steps in Get Started. Resist the temptation to just read or skim through the material. Executing the steps is a much better way to learn the ins and outs of Guild.

I was actually trying to explain my understanding of how guild works. Glad it sounds like my understanding was pretty accurate :rofl:

Ah, I misunderstood this. I though “it” referred to your script haha. Yes, Guild does this — that’s exactly right :slight_smile:

Great explanation. Thanks. I just dont understand how second step in the pipeline import data generated in the first step when they are in separate directories. What is the path?

Guild creates links to the resolved files.

This is an important concept. I recommend running this simple example:

# Save this file as guild.yml to an empty directory.
upstream:
  exec: python -c "open('msg.txt','w').write('hello')"

downstream:
  exec: python -c "print(open('msg.txt').read())"
  requires:
    - operation: upstream

To see how this works, first run upstream:

guild run upstream -y

Next run downstream:

guild run downstream -y

How does it work? Take a look at the directory structure for downstream.

guild open -o downstream

This will open a file explorer and show that msg.txt is a link to the file generated by the latest upstream run.

Note that pipelines are not involved.

Here’s a more complete example.

1 Like

@garret, I understand and use requirements now (even better, I use resources through config).
Your feedback was very useful as always!