DvC support

Summary

This proposal seeks to outline various integrations with DvC to better support data set version control and Guild experiments.
[This is a breaking change.]

This proposal is under development,

Problem

DvC is a well supported and adopted tool for managing changes to data sets. As an experiment tracking tool, Guild both generates data sets as output and consumes data sets as operation dependencies. Integration with DvC should help users take advantage of the strengths of both tools.

Proposed Approach

Guild should support the following use cases:

  • Use data sets (files) stored in DvC as experiment dependencies.
  • Publish runs or run files to DvC repositories.
  • Pull runs stored in DvC repositories.

This proposal consists of two sections (both conceptually raw and under development):

  • DvC Dependencies
  • DvC Remotes

DvC Dependencies

Support a new dvc style dependency. This would use the dvc program (must be installed to support) to copy files from a DvC repository or URL. It may be necessary to support dvc and dvc-url if it’s not otherwise possible to determine the underlying command from the spec.

Do not cache as a resource in the case of DvC, I think.

DvC Remotes

Implement a full features DvC remote to let Guild inspect runs, etc. as it does for S3, ssh, etc.

Alternative Approaches

Pending

1 Like