This proposal seeks to outline various integrations with DvC to better support data set version control and Guild experiments.
[This is a breaking change.]
This proposal is under development,
DvC is a well supported and adopted tool for managing changes to data sets. As an experiment tracking tool, Guild both generates data sets as output and consumes data sets as operation dependencies. Integration with DvC should help users take advantage of the strengths of both tools.
Guild should support the following use cases:
- Use data sets (files) stored in DvC as experiment dependencies.
- Publish runs or run files to DvC repositories.
- Pull runs stored in DvC repositories.
This proposal consists of two sections (both conceptually raw and under development):
- DvC Dependencies
- DvC Remotes
Support a new
dvc style dependency. This would use the
dvc program (must be installed to support) to copy files from a DvC repository or URL. It may be necessary to support
dvc-url if it’s not otherwise possible to determine the underlying command from the spec.
Do not cache as a resource in the case of DvC, I think.
Implement a full features DvC remote to let Guild inspect runs, etc. as it does for S3, ssh, etc.