If you drop the requires
for the model ops, you won’t be able to get access to any of the prepared data files. I think you need to have that.
If there’s a case where your model doesn’t need data files, that might be a separate operation. For example, let’s say your model can load and check a network graph for correctness — it doesn’t need any data for that. In that case, I’d create a separate operation, e.g. load-and-validate-graph
.
If you have a case where your model op runs fake/sample data, create an upstream operation that provides the prepared fake/sample data. You can do that this way:
prepare-data: ...
prepare-sample-data: ...
train:
requires:
- operation: prepare-data|prepare-sample-data
Regarding steps, yes you’d have to do the extremely annoying work of copying all of those flags and maintaining those. I realize that’s a pain and Guild will fix this in an upcoming release. Until then, maybe consider a subset of flags that you likely change for the pipeline op.
Regarding flag sharing, I updated the example project to, I think, do what you want to do.
Take a look at TESTS.md. This shows the behavior that I think you’re after.
You can run the tests:
guild check -nt TESTS.md
The tests use command substitution which might not work if you’re running a non POSIX shell (e.g. Windows command prompt).
There are two main questions answered by this example:
- How do you share data across operations?
- How do you show a value-of-interest of an upstream run in the corresponding downstream run?
Treat these topics separately.
There’s only one correct answer to question 1. You share data across operations through files. That’s it. An upstream operation saves files that the downstream operation reads. You connect these files in Guild using dependencies. If you weren’t using Guild, you connect these files by running scripts using common directories or otherwise pointing to shared files.
It doesn’t matter what encoding scheme you use as long as upstream and downstream operations agree. In the example project, I picked JSON. The prepare-data
operation saves data set metadata in meta.json
. The model ops load that metadata. You can use any scheme you want — Guild doesn’t care.
The remaining question is number 2. When comparing runs, how do you show important values from an upstream run in a downstream run? Guild doesn’t have a good answer for this. It will in an upcoming release.
In the meantime, I hacked flags to show how you can do this today.
Please note that this is a hack — meaning we’re using flags for a purpose they are not intended for. It’s a safe hack, which is why I’m presenting it as an option here.
Technically speaking, the metadata associated with the prepared data are not flags — they are especially not model operation flags. I would describe these as attributes of the upstream run. They may be derived from upstream flag values, but they could also be generated. Whatever they are they’re part of the generated upstream artifacts. The problem is that we want to see what those are whenever we look at one of the downstream runs (model ops) that use these upstream artifacts.
That’s going to take a bit of engineering to get right. But that’s another topic. This is related to this discussion.
In any case, the flag hack is relatively simple and I think gets you your flag sharing scenario while maintaining a correct data interface across runs.
Let me know what you think!