Data versioning

Hi,
I’m getting really curious about your tool. One question that I have is how do you deal with data versioning in the Guild.ai framework? This is a key part of reproducibility. Is there a way to do it (didn’t see in the doc), or are you thinking of integrating with open-source tools like DVC?
Thanks!
Manu

Hello and welcome!

There are two mechanisms in Guild that address data versioning:

  • File resources
  • Runs

A file resource represents a file input to an operation. You express these in a Guild file
through an operation dependency.

train:
  requires:
    - file: data/train.tar.gz
      sha256: abc123

These can also be URLs:

train:
  requires:
    - url: https://my.org/data/train.tar.gz
      sha256: abc123

When Guild runs train in either case, it resolves the dependencies by locating them and verifying their sha digests. It then creates a link to the applicable files in the run directory.

You can also define operation dependencies. This is where one operation requires the output from another operation. We refer to these are downstream and upstream operations respectively.

Consider a case where a prepare-data operation reads records from a database and saves them as files for use by a training operation.

train:
  requires:
  - operation: prepare-data

prepare-data:
  flags:
    db-user: ml-user
    db-pwd: null

Here you need to run prepare-data before you run train. The output from prepare-data — e.g. the data records — are setup as links for train just as they are in the file dependencies above.

While prepare-data might not be recreatable due to changing a changing database, a train operation is recreatable given the same prepare-data run.

In all of these scenarios, Guild supports strict reproducibility.

Versioning data files is outside of Guild’s scope in the same way versioning source code is outside Guild’s scope. Guild captures everything used in an experiment so there’s full auditability. You get strict reproducibility by paying attention to upstream dependencies. But there’s no git-like versioning scheme involved.

I’m curious if some Guild users have also used DVC and what their thoughts are.

1 Like

Thanks for the detailed answer! The SHA and operation make sense, but don’t totally address the issue, though I understand you want to control the scope of this tool.
I’d be curious if people have integrated data versioning within their Guild workflow.

Follow up question: since Guild is not versioning code, what is a good workflow when developing a model? Every time a run happens, the code is saved in the runs directory. So If I’m happier with a previous code version, should I export the saved run code to the repo and commit? Or do you recommend other workflows with git?

Thanks!

You can run guild diff --working for the run in question and pull up your diff tool and see how your local copy is different. If you want to merge changes back into your project, you can take that route.

You can also take the time to commit before running. Guild saves the Git commit version with each run so you can use git to checkout that commit.

But to require a commit just to run an experiment is incredibly tedious IMO. I can’t imagine working like that.

I wonder what Guild is missing here?

Oh cool, I missed the fact that Guild saves the current commit, that’s helpful. I definitely love the fact that you don’t need to commit to run things though, that’s pretty flexible.
Maybe if there was a guild command that would take the code from a run (maybe the best run, or any run picked by the user) and prepare a git commit in the repo that is currently in use, that would be an interesting workflow? Something like guild git-commit --run XYZ
I’ve quickly seen (but not really in details) that you can build plugins for Guild, maybe this command could be a plugin.

That’s an interesting idea! At that point I wonder if proper DVC integration is in order as then the entire run could be versioned in that system. As it is Guild is like the olden days when copies of things were stored in directories and you diff them using various tools. The VCS revolution moved all of that history into a single directory with revisions/commits.

I could see perhaps DVC being essentially the backend “file system” for runs for a given project.

Have you used DVC?

No, but I’ve started looking into it too. At the moment, you can track parameters and output metrics but one run at a time (i.e., you need to commit every time). But they have an interesting discussion going on about handling parameter exploration and corresponding results more efficiently. See here

Terrific thread — thanks for the link!

That highlights they differences between Guild and DVC’s approach. DVC is very git centered — the entire discussion centers on how to use git facilities to track experiments.

The opening set of requirements I think is worthy of repeating here:

High-level requirements to the hyperparameters tunning stage:

  1. Run. Run dozens of experiments without committing any results into Git while keeping track of all the experiments. Each of the experiments includes a small config change or code change (usually, 1-2 lines).
  2. Compare. A user should be able to compare two experiments: see diffs for code (and probably metrics)
  3. Visualize. A user should be able to see all the experiments results: metrics that were generated. It might be some table with metrics or a graph. CSV table needs to be supported for custom visualization.
  4. Propagate. Choose “the best” experiment (not necessarily the highest metrics) and propagate it to the workspace (bring all the config and code changes. Important: without retraining). Then it can be committed to Git. This is the final result of the current hyperparameter tunning stage. After that, the user can continue to work with a project in a regular Git workflow.
  5. Store. Some (or all) of the experiments might be still useful (in additional to “the best” one). A user should be able to commit them to the Git as well. Preferably in a single commit to keep the Git history clean.
  6. Clean. Not useful experiments should be removed with all the code and data artifacts that were created. A special subcommand of dvc gc might be needed.
  7. [*] Parallel. In some cases, the experiments can be run in parallel which aligns with DVC parallel execution plans: #2212, #755. This might not be implemented now (in the 1st version of this feature) but it is important to support parallel execution by this new lightweight abstraction.
  8. Group. Iterations of hyperparameters tuning might be not related to each other and need to be managed and visualized separately. Experiments need to be grouped somehow.

A lot of the challenges with these features (see the original discussion) apply only when you assume that everything is stored in git. When you relax that assumption, the problems become simpler.

Guild separates the concerns of experiment management from long term artifact storage. While Guild stores runs in a specific way, it does so without any opaque facilities. Everything is stored on normal file systems in plain view (if you care to look).

If I were to make a comparison to git, Guild’s workflow corresponds git’s “working tree” or “workspace”. This is where you’re free to change things, run your code, see what happens, etc. In git, you commit only when you’re ready to commit — no one forces you to commit your changes every time you run your program. How incredibly frustrating would that be?

In Guild, you run experiments with zero ceremony. In fact, running your script at all, ever, every time is a measured experiment. So you can run dozens and dozens of experiments, very casually, without thinking about it.

With Guild you always measure.

This is hard to appreciate unless you experience the “ah ha” of going back to a failed run — something you never considered to be valuable — and find the answer to a question. If you don’t make measurement so trivial that you don’t even think about it, you won’t experience this

If and when you want to save something for posterity, then you might consider saving to git. I think that’s where Guild and DVC could connect. From what I’m seeing, folks are using DVC as “git with support for storing large files”. Guild could support DVC as a “remote type” for push/pull commands.

Personally I think this goes against the grain of git. It’s not the case that git is designed for. Normal file systems are well suited for large binary storage. There’s nothing that git offers in that case that can’t be accomplished using traditional tools like diff and file system links. In fact, as that thread shows, using git makes things harder.

This is not to say anything derogatory about DVC! I want to highlight the differences in approaches and why Guild can handle “reproducibility” perfectly well without using git. There’s some more work for Guild on this front, but nothing that git is going to help with.

1 Like

I want to bump this up, using DVC with guild seems like a great idea:

  • tracking code for feature engineering and such in git is a wonderful idea, and DVC is good at that
  • DVC fails miserably when it comes to experiment tracking

DVC is extremely cumbersome when it comes to running different experiments: since it’s git based, every hyperparameter change has to be versioned in files.
The ease of doing this is light years behind guild.
For example DVC by design has problems with including command line arguments. Basically you can’t just pass hyperparameters as arguments to script. To make matters worse, you even have to hardcode config file that is used in the script.
Also you have to specify all hyperparameters that you use in dvc run command. It took me a couple of hours to figure out what’s happening, from user perspective the fact that you have to specify parameters, but then don’t even pass them around confused the hell out of me.

I think it’s very bad idea because then you have to keep track of not one, but two files for hyperparameters of a single experiment
The issue that tracks workarounds also doesn’t seem like a good approach because it seems like pretty arcane stuff happening in YAML files.

1 Like

Just wanted to say I have the exact same experience. DVC does data and code tracking well - experiment tracking is a pain.

This is great input - thank you for that. I’m bumping the priority of DVC integration for 0.8. I think it makes most sense to make it easy to grab files from a DVC repo as dependencies. Something like this:

# guild.yml

op:
  requires:
    - dvc: <some spec to point to file(s) in DVC>

The idea here is to let you store officially versioned artifacts in DVC, as you would any file repository. This scheme just makes is easier to access those files for an operation.

I think also Guild could support DVC as a remote type.

~/.guild/config.yml

remotes:
  dvc:
    type: dvc
    # connect / auth attrs

The to store Guild runs in DVC:

guild push dvc

And similar for pull, etc.

This remote support would be storage only — run would not be supported. This follows the limitation of the s3 remote type.

Any thoughts on this? Anything else to add?

1 Like

I think you are on point with the storage only. The way we currently use dvc is as data pointers to external storage only, which is then tracked by git. The rest of dvc functionality like dvc repro and dvc exp is just to cumbersome and is light years behind guild.

Your suggestion with requires and have a dvc specific attribute is spot-on. At least for our use case.

So instead of doing this:

dvc pull data.dvc
guild run train  # This uses the pulled data file from dvc

You could just do

guild run train

and guild would automatically resolve the data.dvc dependencies by doing dvc pull behind the scenes.

1 Like

Keep an eye out—this is coming!

3 Likes

A pre-release of 0.8 is now available with preliminary DvC support!

To install the pre-release, run:

pip install guildai --pre

Please feel free to comment here with questions or feedback. We can use GitHub issues to track resolution of specific problems as they come up.

Thanks to everyone who’s expressed interest in this functionality! I’m looking forward to hearing your feedback so we can further improve this feature set!

This is exciting! Will try it out asap.

A couple of immediate questions:

  • When we use DvC we like to check-in each individual raw data file. That can be different video recordings. That means we’ll have a folder structure ala:
data/raw/
├── Video1.mp4.dvc
├── Video2.mp4.dvc
├── .....
├── Video100.mp4.dvc
  • I think having the dvcfile source type accepting glob expansions or similar would be beneficial in this case.
  • Any plans on adding support for easily checking-in guild runs to DvC?

I’ll keep you posted on how it works when I start playing around with the update.

Also wondering if something like

guild export <RUN_ID> --dvc

Or similiar would be helpful for people that want to keep their guild runs checked in their repo.