Using Guild AI with DvC

Overview

0.8.0.pre1 offers an early release of Guild’s support for DvC. To install this version, run pip install guildai --pre.

Guild’s DvC integration supports two scenarios:

  • Run DvC stages as Guild operations
  • Define dependencies on DvC managed files using either dvcfile or dvcstage sources

Run DvC Stages

With this new DvC support, Guild can run stages defined in dvc.yaml like this:

guild run dvc.yaml:<stage>

For example, consider a dvc.yaml file that defines a stage for training a model.

stages:
  train-model:
    deps:
      - data.csv
    params:
      - lr
      - epochs
    cmd: python train.py
    outs:
      - model.joblib

When you run this stage with Guild, Guild performs a number of steps to generate a Guild run:

  • Run results are stored in a newly created run directory, like any other Guild run
  • DvC configuration is copied to the run directory and used to initialize a new, run-specific DvC repository
  • Any parameter files are written to the run directory with run-specific flag values
  • File dependencies are resolved by copying them from the project directory or by calling dvc pull in the run directory
  • Metrics are logged as scalars

These steps ensure that the run is isolated from any changes to the project directory. Because Guild generates custom parameter source files based on run flag values, you can run your DvC stages in batches, including hyperparameter optimization.

Guild also detects DvC stage dependency and ensures that the correct output files are resolved for a run.

This approach is different from running DvC normally. DvC runs operations in the project directory, modifying files as needed to ensure correct state. Guild’s approach ensures that your project state remains unchanged and that your runs accurately reflect the project state at the time they’re started.

DvC runs are identical to other Guild runs, letting you use the full suite of Guild tools to manage and study them.

DvC Managed Files

DvC is used to manage data sets and other non-source code files in a project. For more information, see DvC Get Started: Data Versioning.

If a Guild operation requires a file managed by DvC, you can list the file as a dependency using the dvcfile source type, referring to the DvC file name as follows:

train:
  requires:
    - dvcfile: data.zip

When resolving the dependency data.zip in this example, Guild first looks in the project for the file and uses it if it’s available. If the file is not in the project, Guild calls dvc pull in the run directory to fetch the file.

To fetch the file using DvC, Guild needs the following project configuration:

  • A *.dvc file for the dependency (e.g. data.zip.dvc in the example above)
  • A default remote configured for the project in either .dvc/config or dvc.config.in

Guild uses the default remote configured in the DvC config (either .dvc/config or dvc.config.in) by default. Use the source remote attribute to specify an explicit remote for the dependency. For example:

train:
  requires:
    - dvcfile: data.zip
      remote: my-remote-in-s3

Guild also supports dependencies on files generated by DvC stages. For more information, see below.

DvC Stage Output Dependencies

You can define a dependency on DvC stage output for a Guild operation using the dvcstage source type. For example, if a file model.joblib is generated by a train-model DvC stage in a project, you can define a dependency for the model file this way:

test-model
  requires:
    - dvcstage: train-model
      select: model.joblib

This is similar to an operation source type but instead specifies a DvC stage.

In this case, if dvc.yaml defines a train-model stage that outputs model.joblib, Guild looks for a run with that name and link to the applicable file. This follows the convention used by Guild operation dependencies.

Example

For a detailed example of Guild’s DvC integration, refer to DvC Example in the Guild AI source repository.

2 Likes