Using data generated from previous step in Pipeline

teracamo · September 17, 2020, 1:37pm

In a pipeline, steps has its own folder directory and run key, the structure as far as I observe is like this:

├──pipeline_rundir
   ├── symlink_to_step1_rundir
   │   ├── requested_resources_1
   │   └── run_generated_resources
   └── symlink_to_step2_rundir
       └── requested_resources_2

Is it possible to forward the resources generated run_generated_resources in step 1 to the directory for step 2 without opening a shared folder (such that each pipeline run can pass the trained network state to the validation step).

garrett · September 17, 2020, 2:15pm

If I understand you correctly, this should happen via dependencies. E.g. requested_resources_2 I assume is setup by Guild through a dependency. You can do the same for run_generated_resources using an operation dependency.

The pipeline is not really a part of that relationship. You can resolve all required resources simply by running the downstream run - in this example step2.

Let’s set aside the pipeline operation for the time being.

step1:
  requires:
    - file: requested_resources_1
step2:
  requires:
    - file: requestd_resources_2
    - operation: step1
      select: run_generated_resources

The select is optional. You use that to link to only matching files generated by step1.

When you run step2, Guild looks for a run for step1 and creates links to the selected files from that run.

teracamo · September 18, 2020, 6:31am

Thanks, this does exactly what I need.

But I think it behave slightly different to the first level select and it as it doesn’t glob with wildcard e.g. select: '*.txt, not sure if intended behavior, so I have to generate it to a new folder relative to the run and select that folder instead.

mislav · September 18, 2020, 12:27pm

What is really the advange of dependencies? If I have 2 steps (say prepare and train) and I save file from the first step in some directory and than load this same file in train, and combine everything in the pipeline, why would I need dependency, when everyiting is chained together by saving/loading files through this 2 steps?

teracamo · September 18, 2020, 12:42pm

In my case, I generates different outputs in each runs from the first step and than evaluate it in the second. So the generated files are “runs dependent” and I cannot track them by saving them to the same folder because I do parallel running so there’s a chance the output files will be covered before I can do analysis in this setting.

I can also save them for further use somewhere else with good data organization.

mislav · September 18, 2020, 1:08pm

So you create specific folder for every run to save data from first step into it? How do you name folders for every run?

teracamo · September 18, 2020, 1:30pm

I just dump everything into a folder named ‘Output’ and don’t have specific naming system for each run. Which is why guild so useful to me because it create sort of separate containers for me so I can systematically store the data with ease and without having to worry accidently covering the previous outputs.

mislav · September 18, 2020, 1:36pm

I understand guild saves flags and scalars in guild home folder. But it doesn’t save data. If data is saved to output folder and if you run two runs at the same time (in parallel as you said), than it is possible that one run will be overridden by one of the runs.

teracamo · September 18, 2020, 1:59pm

Basically in each guild run, it creates a container automatically for you under (by default) you python env directory, with the directory named with 256-bit key generated presumably randomly. It that copies all the source-code to that container’s directory and run the code there, any relative directories will be relative to that container.

For example the first run, its initiated in .guild/[1st_run_256-bit_label], if you generate your data with relative dirs like this file('./output.txt', 'w'), it will be generated at .guild/[1st_run_256-bit_label]/output.txt. If you run parrallel the 2nd run, running the exact same code, it will be generated at .guild/[2nd_run_256-bit_label]/output.txt. So there won’t be overwritting. You are right that guild won’t be managing your outputs though, you still need to process that yourself.

I am no expert really but I learn how guild build these folder structure by actually browsing the guild home dir. You can find the directory of each run by guild open [num] --cmd "echo" on linux platform.

garrett · September 18, 2020, 2:38pm

The value for select is applied as a regular expression. So you’d use this:

downstream:
  requires:
    - operation: upstream
      select: .*\.txt

Regular expressions are a bit painful for simple cases like this (we have an issue that proposes supporting glob wildcards) but it comes in handy for anything even slightly complicated.

mislav · September 18, 2020, 2:41pm

@teracamo, how do you then load that data in the second step? The second step created another directory.

garrett · September 18, 2020, 2:46pm

The pipeline really doesn’t do any sort of roll-up — it just runs the steps and creates links to each step run directory. The pipeline e.g. doesn’t contain the step runs. Guild doesn’t nest runs — everything is in a flat list under $GUILD_HOME/runs. If runs are related, as in the case of a pipeline and its steps, the relationship is expressed using links.

If an operation needs a file, that file must be resolved using dependencies. There’s no clean way around this. This makes dependencies quite important to Guild.

garrett · September 18, 2020, 2:53pm

Guild doesn’t save data, that’s technically true — your script saves data. But the intent is 100% that a run contains saved data. That’s really the essence of an experiment. It’s not just the flags and scalars but also any generated file.

When Guild generates a run, it creates a unique directory using a UUID. There’s no chance of overwriting files provided your script saves to a relative path. Guild makes no attempt to jail an operation process. You’re free to save files wherever you want, including outside the run directory. But Guild runs the operation process using the run directory as the current working directory (e.g. os.getcwd()). So if you save a file like this:

open("hello.txt", "w").write("hello!")

you’ll see that run show up in the run directory. You can use guild ls to list these files.

So in this sense, Guild very much “saves data”. Because each run uses a unique directory, you’re perfectly safe in writing whatever you want, again, provided you use a relative path.

garrett · September 18, 2020, 3:14pm

You’re duplicating what Guild does

If you haven’t already, I suggest that you execute the steps in Get Started. Resist the temptation to just read or skim through the material. Executing the steps is a much better way to learn the ins and outs of Guild.

teracamo · September 18, 2020, 3:39pm

I was actually trying to explain my understanding of how guild works. Glad it sounds like my understanding was pretty accurate

garrett · September 18, 2020, 3:43pm

Ah, I misunderstood this. I though “it” referred to your script haha. Yes, Guild does this — that’s exactly right

mislav · September 18, 2020, 3:47pm

Great explanation. Thanks. I just dont understand how second step in the pipeline import data generated in the first step when they are in separate directories. What is the path?

garrett · September 18, 2020, 4:00pm

Guild creates links to the resolved files.

This is an important concept. I recommend running this simple example:

# Save this file as guild.yml to an empty directory.
upstream:
  exec: python -c "open('msg.txt','w').write('hello')"

downstream:
  exec: python -c "print(open('msg.txt').read())"
  requires:
    - operation: upstream

To see how this works, first run upstream:

guild run upstream -y

Next run downstream:

guild run downstream -y

How does it work? Take a look at the directory structure for downstream.

guild open -o downstream

This will open a file explorer and show that msg.txt is a link to the file generated by the latest upstream run.

Note that pipelines are not involved.

Here’s a more complete example.

mislav · September 21, 2020, 9:16am

@garret, I understand and use requirements now (even better, I use resources through config).
Your feedback was very useful as always!

Topic		Replies	Views
Guild steps and pipeline - reuse same run General	9	1688	July 2, 2021
Runs vs storing models General	1	424	November 25, 2020
How to have optional run resources General	8	1177	October 5, 2022
Change default select rules for operation dependencies RFC	0	285	January 18, 2023
An effective way to export some of the data generated in a run to a designated location? General	2	572	January 15, 2021

Using data generated from previous step in Pipeline

Related topics