DvC remote support

Summary

In 0.8.0 Guild released preliminary support for DvC. The initial feature set supports:

  • Run DvC stages as Guild operations
  • Define dependencies on DvC managed files using either dvcfile or dvcstage sources

Missing from this feature set, however, is the ability to interface with a DvC remote — i.e. pushing and pulling runs as DvC tracked files.

This proposal describes the set of changes to Guild needed to support interaction with DvC remotes.

This proposal is under development

Problem

DvC is a standard way to store data sets and model artifacts used in data-oriented operations. Guild is used to generate such data sets and model artifacts by way of operations. It is currently painful for users to upload Guild-generated files to DvC.

Guild should make this simple.

Proposed Approach

While there are approaches, this proposal considers most seriously the option of creating a DvC Guild remote. A DvC remote would support the following run-management commands:

ls List files for a remote run
runs List runs on a remote
runs info Show information about a remote run
runs delete Delete remote runs
runs restore Restore deleted remote runs on a remote
runs purge Purge deleted remote runs on a remote
pull Copy remote runs to the local environment
push Copy local runs to the remote

Use Case: Publish Guild runs for use as remote DvC dependency

This use case involves a Run publisher and a DvC file consumer.

Run publisher

A run publisher generates a Guild run and publishes it to a DvC remote.

  1. Configure a DvC remote in user config (optional)
  2. Generate a run with Guild
guild run <op>
  1. Push the run to a DvC remote
guild push my-dvc-remote

or:

guild push dvc:<dvc remote info>

This copies the run to the specified DvC remote where it can be queried, either using DvC commands or Guild commands. E.g.

guild runs list --remote my-dvc-remote
guild ls --remote my-dvc-remote

DvC file consumer

A DvC file consumer uses one or more Guild-published artifacts as dependencies in a DvC stage.

The consumer configures the stage using the applicable DvC URL/reference to the Guild-published artifact(s).

TODO: Need a yaml example here

The user runs the stage using DvC.

Alternative Approaches

The sections below describe alternative approaches.

Use guild publish with DvC smarts

Guild’s remote support treats run directories as black boxes — what’s pushed is exactly as Guild wants it in order to store a run. While Guild stores runs in a straight forward way, it’s file structure might not be what is intended/wanted for a DvC user.

The publish command might be used as an alternative to guild push to copy specific files for a run to a DvC remote.

To simplify the user experience, Guild could provide a DvC aware template (or other built-in logic) to copy certain run files (e.g. only those generated by the run, etc.)

Note that guild publish is intended to create a user-facing artifact (e.g. a directory with a website, files to push to GitHub, etc.) and guild push is used to replicate local runs in a remote location.

Having spent some time on a prototype for DvC remote support, I am having second thoughts about this feature as described (i.e. storage oriented DvC remote support)

DvC remote support is a straight forward remote interface (copy to and from) their content store. It’s not a general purpose file store. It uses general purpose file stores however.

The functionality outlined in this proposal can be achieved today using standard Guild remotes and then referencing those remote files from DvC using the applicable URLs.

It’s not clear having this functionality will provide much value.

There might be some fruit in extending guild publish with some DvC support. Without further input I’d be speculating though.

@garrett

The one use case we have is the following:

We have a training (All Python) repository and inference repository (C++, Robotics, Python).

We train our models in training using guild, pytorch and the whole Python stack. In inference we need to load the model from training, convert it to a binary executable and use in C++.

The reason we like DvC for this is because we can always check-in the latest and greatest model in git and dvc in the training repo and then use dvc import in inference.

Doing it with git + dvc aligns nicely with a standard git workflow. I guess using guild and native guild remote, we could keep track of the guild run IDs, which wouldn’t be too big of a deal, I guess.

Ah, very helpful!

How do you do this today? Do you copy the files from the Guild run dir(s) into the project directory so you can save it using DvC?

How do you determine “latest and greatest” - is just the latest trained model or do you perform some sort of analysis and select a model to store (for inference) based on performance criteria?

Currently we simply copy from guild run dirs and the check the model file in in training using dvc, but this way we lose all the guild run info that would be nice to track as well.

We will only check-in the latest and greatest AFTER we have performed analysis (using guild). Then we create a PR, run some build with the new model (as part of our CI) and updates dashboards with our progress over time/commits. These dashboards and builds also use guilds feature set.

Then in inference we just reference the model file in the main branch in training since we know it has gone through a rigorous PR review in training. If we need an older version, we just reference an old commit in training.

To add to this, I don’t necessarily think DvC is needed for this. The reason we use it now, is because we need this PR review process for new models that are being trained (this is a required process in our regulatory environment). Using guild only, since it is kind of decoupled from git, can make this a little tricky - or at least I haven’t found a good way of doing it yet.

So the PR process here is git based and the review looks over source code? Do you include any run-generated artifacts (e.g. flags, scalars, generated plots, etc.) in the review? Or is this just source code folks are looking at?

And the PR I assume is then used to merge the latest/greatest code into an upstream tracking branch in git?

We want to include all the flags, scalars etc. in the review, but we currently don’t.

Yes, correct with the reference to the new model checked in to DvC.

Is the commit to source code and DvC in your project directory something that’s automatic, after some analysis is performed to select the best run? E.g. what happens when your local/working project source code is different from the source code used in the selected/best run? Do you study the differences and hand-merge, or is there an automatic replacement of local changes?

It’s very much the use case that CML supports.

We don’t have anything that enforces that right now and that can be an issue. That is also why I proposed this issue in github.

There’s nothing though from keeping you from modifying your source code after the run is started (this is a feature IMO in that you can tee up new experiments without worrying about corrupting your runs-in-progress or staged runs with local code changes).

I don’t see the workflow you’re describing in the CML docs.

What do you think of something like this:

  1. Start one of more runs with Guild (this could be a single run, many runs using hparam optimizations, or a series of experiments conducted over hours or days - doesn’t matter). During these experiments you can modify your code without worrying about commits to git.

  2. Evaluate the results (using Guild or something else) to find a run that you’re interested in submitting as a PR.

  3. Run a command in Guild (this is something new we’d create - e.g. guild merge or something like that) that would cause your project/source code directory to become in sync with the Guild run directory. Source and various run artifacts would be copied from that run directory to your project directory. The list of files copied could be driven by git (for source code) and DvC (for non-source files). The command would not automatically commit anything - it would leave your working directory ready for commits by you.

  4. From your project directory, modified by Guild to reflect the state of the selected run, you would commit the changes using git, DvC, whatever.

Guild could provide some safeguards to ensure that you didn’t lose local changes (e.g. by making backups somewhere, so you could revert a merge, or verifying that each file copied is already committed in git, etc.) Or it could insist that files by committed to git and up-to-date in DvC.

This would let you keep your project in sync easily with a particular run and then do whatever you want with that (PRs, etc.)

This workflow is unlike the DvC workflow, which writes everything to your local directory by design (causes problems with consistency, esp in cases of concurrent operations).

You can freely experiment by changing code, running, changing code, etc. without worrying about commits, synchronization with other systems, etc.

Each run is its own system of record for that run - what is in that run directory is exactly what ran, no question.

The decision to bring over run state into a project is explicit (not automatic) and something that you the human control. You’re responsible for the commands needed to get the changes into git, DvC, or any other system. Guild is just helping make your project state consistent with a run. This includes source code, metadata (flags, generated config, scalars, analytics like plots and other summaries, and of course trained models).

@copah apologies for the long post above! This is a mini-proposal for a new feature that I think might provide a helpful workflow to you and your team. We’re prepared to start work on this next week, if you like the sound of things. If you can spot any issues, let us know and we can tweak the approach as needed.

Thanks again for your input!

@garrett

Thank you very much for this detailed proposal. And sorry for the late reply, as mentioned, I just completed a coast to coast move.

I think the proposal looks great and would definitely work with our workflow. Let me know if you need more input from me than this.

I created an RFC for this: Guild merge