Overview
0.8.0.pre1 offers an early release of Guild’s support for DvC. To install this version, run pip install guildai --pre
.
Guild’s DvC integration supports two scenarios:
- Run DvC stages as Guild operations
- Define dependencies on DvC managed files using either
dvcfile
ordvcstage
sources
Run DvC Stages
With this new DvC support, Guild can run stages defined in dvc.yaml
like this:
guild run dvc.yaml:<stage>
For example, consider a dvc.yaml
file that defines a stage for training a model.
stages:
train-model:
deps:
- data.csv
params:
- lr
- epochs
cmd: python train.py
outs:
- model.joblib
When you run this stage with Guild, Guild performs a number of steps to generate a Guild run:
- Run results are stored in a newly created run directory, like any other Guild run
- DvC configuration is copied to the run directory and used to initialize a new, run-specific DvC repository
- Any parameter files are written to the run directory with run-specific flag values
- File dependencies are resolved by copying them from the project directory or by calling
dvc pull
in the run directory - Metrics are logged as scalars
These steps ensure that the run is isolated from any changes to the project directory. Because Guild generates custom parameter source files based on run flag values, you can run your DvC stages in batches, including hyperparameter optimization.
Guild also detects DvC stage dependency and ensures that the correct output files are resolved for a run.
This approach is different from running DvC normally. DvC runs operations in the project directory, modifying files as needed to ensure correct state. Guild’s approach ensures that your project state remains unchanged and that your runs accurately reflect the project state at the time they’re started.
DvC runs are identical to other Guild runs, letting you use the full suite of Guild tools to manage and study them.
DvC Managed Files
DvC is used to manage data sets and other non-source code files in a project. For more information, see DvC Get Started: Data Versioning.
If a Guild operation requires a file managed by DvC, you can list the file as a dependency using the dvcfile
source type, referring to the DvC file name as follows:
train:
requires:
- dvcfile: data.zip
When resolving the dependency data.zip
in this example, Guild first looks in the project for the file and uses it if it’s available. If the file is not in the project, Guild calls dvc pull
in the run directory to fetch the file.
To fetch the file using DvC, Guild needs the following project configuration:
- A
*.dvc
file for the dependency (e.g.data.zip.dvc
in the example above) - A default remote configured for the project in either
.dvc/config
ordvc.config.in
Guild uses the default remote configured in the DvC config (either .dvc/config
or dvc.config.in
) by default. Use the source remote
attribute to specify an explicit remote for the dependency. For example:
train:
requires:
- dvcfile: data.zip
remote: my-remote-in-s3
Guild also supports dependencies on files generated by DvC stages. For more information, see below.
DvC Stage Output Dependencies
You can define a dependency on DvC stage output for a Guild operation using the dvcstage
source type. For example, if a file model.joblib
is generated by a train-model
DvC stage in a project, you can define a dependency for the model file this way:
test-model
requires:
- dvcstage: train-model
select: model.joblib
This is similar to an operation
source type but instead specifies a DvC stage.
In this case, if dvc.yaml
defines a train-model
stage that outputs model.joblib
, Guild looks for a run with that name and link to the applicable file. This follows the convention used by Guild operation dependencies.
Example
For a detailed example of Guild’s DvC integration, refer to DvC Example in the Guild AI source repository.