Summary
This proposal seeks to outline various integrations with DvC to better support data set version control and Guild experiments.
[This is a breaking change.]
This proposal is under development,
Problem
DvC is a well supported and adopted tool for managing changes to data sets. As an experiment tracking tool, Guild both generates data sets as output and consumes data sets as operation dependencies. Integration with DvC should help users take advantage of the strengths of both tools.
Proposed Approach
Guild should support the following use cases:
- Use data sets (files) stored in DvC as experiment dependencies.
- Publish runs or run files to DvC repositories.
- Pull runs stored in DvC repositories.
This proposal consists of two sections (both conceptually raw and under development):
- DvC Dependencies
- DvC Remotes
DvC Dependencies
Support a new dvc
style dependency. This would use the dvc
program (must be installed to support) to copy files from a DvC repository or URL. It may be necessary to support dvc
and dvc-url
if it’s not otherwise possible to determine the underlying command from the spec.
Do not cache as a resource in the case of DvC, I think.
DvC Remotes
Implement a full features DvC remote to let Guild inspect runs, etc. as it does for S3, ssh, etc.
Alternative Approaches
Pending