0.8.0.pre1 offers an early release of Guild’s support for DvC. To install this version, run
pip install guildai --pre.
Guild’s DvC integration supports two scenarios:
- Run DvC stages as Guild operations
- Define dependencies on DvC managed files using either
With this new DvC support, Guild can run stages defined in
dvc.yaml like this:
guild run dvc.yaml:<stage>
For example, consider a
dvc.yaml file that defines a stage for training a model.
stages: train-model: deps: - data.csv params: - lr - epochs cmd: python train.py outs: - model.joblib
When you run this stage with Guild, Guild performs a number of steps to generate a Guild run:
- Run results are stored in a newly created run directory, like any other Guild run
- DvC configuration is copied to the run directory and used to initialize a new, run-specific DvC repository
- Any parameter files are written to the run directory with run-specific flag values
- File dependencies are resolved by copying them from the project directory or by calling
dvc pullin the run directory
- Metrics are logged as scalars
These steps ensure that the run is isolated from any changes to the project directory. Because Guild generates custom parameter source files based on run flag values, you can run your DvC stages in batches, including hyperparameter optimization.
Guild also detects DvC stage dependency and ensures that the correct output files are resolved for a run.
This approach is different from running DvC normally. DvC runs operations in the project directory, modifying files as needed to ensure correct state. Guild’s approach ensures that your project state remains unchanged and that your runs accurately reflect the project state at the time they’re started.
DvC runs are identical to other Guild runs, letting you use the full suite of Guild tools to manage and study them.
DvC is used to manage data sets and other non-source code files in a project. For more information, see DvC Get Started: Data Versioning.
If a Guild operation requires a file managed by DvC, you can list the file as a dependency using the
dvcfile source type, referring to the DvC file name as follows:
train: requires: - dvcfile: data.zip
When resolving the dependency
data.zip in this example, Guild first looks in the project for the file and uses it if it’s available. If the file is not in the project, Guild calls
dvc pull in the run directory to fetch the file.
To fetch the file using DvC, Guild needs the following project configuration:
*.dvcfile for the dependency (e.g.
data.zip.dvcin the example above)
- A default remote configured for the project in either
Guild uses the default remote configured in the DvC config (either
dvc.config.in) by default. Use the source
remote attribute to specify an explicit remote for the dependency. For example:
train: requires: - dvcfile: data.zip remote: my-remote-in-s3
You can define a dependency on DvC stage output for a Guild operation using the
dvcstage source type. For example, if a file
model.joblib is generated by a
train-model DvC stage in a project, you can define a dependency for the model file this way:
test-model requires: - dvcstage: train-model select: model.joblib
This is similar to an
operation source type but instead specifies a DvC stage.
In this case, if
dvc.yaml defines a
train-model stage that outputs
model.joblib, Guild looks for a run with that name and link to the applicable file. This follows the convention used by Guild operation dependencies.
For a detailed example of Guild’s DvC integration, refer to DvC Example in the Guild AI source repository.