In 0.8.0 Guild released preliminary support for DvC. The initial feature set supports:
- Run DvC stages as Guild operations
- Define dependencies on DvC managed files using either dvcfile or dvcstage sources
This proposal describes the set of changes to Guild needed to support interaction with DvC remotes.
This proposal is under development
DvC is a standard way to store data sets and model artifacts used in data-oriented operations. Guild is used to generate such data sets and model artifacts by way of operations. It is currently painful for users to upload Guild-generated files to DvC.
Guild should make this simple.
While there are approaches, this proposal considers most seriously the option of creating a DvC Guild remote. A DvC remote would support the following run-management commands:
||List files for a remote run|
||List runs on a remote|
||Show information about a remote run|
||Delete remote runs|
||Restore deleted remote runs on a remote|
||Purge deleted remote runs on a remote|
||Copy remote runs to the local environment|
||Copy local runs to the remote|
This use case involves a Run publisher and a DvC file consumer.
A run publisher generates a Guild run and publishes it to a DvC remote.
- Configure a DvC remote in user config (optional)
- Generate a run with Guild
guild run <op>
- Push the run to a DvC remote
guild push my-dvc-remote
guild push dvc:<dvc remote info>
This copies the run to the specified DvC remote where it can be queried, either using DvC commands or Guild commands. E.g.
guild runs list --remote my-dvc-remote
guild ls --remote my-dvc-remote
A DvC file consumer uses one or more Guild-published artifacts as dependencies in a DvC stage.
The consumer configures the stage using the applicable DvC URL/reference to the Guild-published artifact(s).
TODO: Need a yaml example here
The user runs the stage using DvC.
The sections below describe alternative approaches.
Guild’s remote support treats run directories as black boxes — what’s pushed is exactly as Guild wants it in order to store a run. While Guild stores runs in a straight forward way, it’s file structure might not be what is intended/wanted for a DvC user.
publish command might be used as an alternative to
guild push to copy specific files for a run to a DvC remote.
To simplify the user experience, Guild could provide a DvC aware template (or other built-in logic) to copy certain run files (e.g. only those generated by the run, etc.)
guild publish is intended to create a user-facing artifact (e.g. a directory with a website, files to push to GitHub, etc.) and
guild push is used to replicate local runs in a remote location.