Summary
In 0.8.0 Guild released preliminary support for DvC. The initial feature set supports:
- Run DvC stages as Guild operations
- Define dependencies on DvC managed files using either dvcfile or dvcstage sources
Missing from this feature set, however, is the ability to interface with a DvC remote — i.e. pushing and pulling runs as DvC tracked files.
This proposal describes the set of changes to Guild needed to support interaction with DvC remotes.
This proposal is under development
Problem
DvC is a standard way to store data sets and model artifacts used in data-oriented operations. Guild is used to generate such data sets and model artifacts by way of operations. It is currently painful for users to upload Guild-generated files to DvC.
Guild should make this simple.
Proposed Approach
While there are approaches, this proposal considers most seriously the option of creating a DvC Guild remote. A DvC remote would support the following run-management commands:
ls |
List files for a remote run |
runs |
List runs on a remote |
runs info |
Show information about a remote run |
runs delete |
Delete remote runs |
runs restore |
Restore deleted remote runs on a remote |
runs purge |
Purge deleted remote runs on a remote |
pull |
Copy remote runs to the local environment |
push |
Copy local runs to the remote |
Use Case: Publish Guild runs for use as remote DvC dependency
This use case involves a Run publisher and a DvC file consumer.
Run publisher
A run publisher generates a Guild run and publishes it to a DvC remote.
- Configure a DvC remote in user config (optional)
- Generate a run with Guild
guild run <op>
- Push the run to a DvC remote
guild push my-dvc-remote
or:
guild push dvc:<dvc remote info>
This copies the run to the specified DvC remote where it can be queried, either using DvC commands or Guild commands. E.g.
guild runs list --remote my-dvc-remote
guild ls --remote my-dvc-remote
DvC file consumer
A DvC file consumer uses one or more Guild-published artifacts as dependencies in a DvC stage.
The consumer configures the stage using the applicable DvC URL/reference to the Guild-published artifact(s).
TODO: Need a yaml example here
The user runs the stage using DvC.
Alternative Approaches
The sections below describe alternative approaches.
Use guild publish
with DvC smarts
Guild’s remote support treats run directories as black boxes — what’s pushed is exactly as Guild wants it in order to store a run. While Guild stores runs in a straight forward way, it’s file structure might not be what is intended/wanted for a DvC user.
The publish
command might be used as an alternative to guild push
to copy specific files for a run to a DvC remote.
To simplify the user experience, Guild could provide a DvC aware template (or other built-in logic) to copy certain run files (e.g. only those generated by the run, etc.)
Note that guild publish
is intended to create a user-facing artifact (e.g. a directory with a website, files to push to GitHub, etc.) and guild push
is used to replicate local runs in a remote location.