Canonical Guild file data structure

Summary

This proposal seeks to formalize Guild’s project configuration (a Guild file) as a data structure.

This proposal is awaiting feedback

Problem

Guild supports a flexible project configuration structure. However, the structure is not formally defined. It is implemented as a set of transformation behaviors defined in the Python module guild.guildfile and in supporting modules, including an externally configurable plugin system.

The lack of a formal structure presents the following problems:

  • It’s impractical to validate the correctness of project configuration
  • There is no point where a final, canonical data structure exists (to be viewed, saved, etc.)
  • It’s impractical to save complete project configuration in a run (Guild currently saves subsets of the configuration to run attributes for specific uses)
  • It’s impractical to provide IDE and other tool support (e.g. code completion) for editing project configuration
  • It’s impractical to apply generalize algorithms to merge or otherwise combine multiple configurations (a problem presented by the original R plugin, which does not support Guild file configuration)

Proposed Approach

Guild will provide an external schema for project configuration. This should be defined outside of Python imperative code (e.g. JSON). This schema will be versioned and correspond to Guild’s feature set at the time of publishing.

Guild will provide a command to check/validate project configuration.

Guild will formalize the process of generating a fully resolved project configuration data structure. This will include the stages: reading configuration from disk, coercing simplified configuration to its canonical form, and transformations applied by plugins. Guild will support visibility and validation into each stage, which will be used for internal testing, external tool validation, and user validation of project configuration.

To Do

- Identify a schema system to use

Consider how the system is generally supported outside Python and useful to other tools. It should be possible to generate a disk-based artifact and validate it using pre-existing tools (not home grown). The system must support tracking original configuration files and line numbers per configuration value.

If such a system doesn’t exist (it might not!) we could consider writing our own, though this is something to avoid if possible.

Alternative Approaches

Do nothing

The “do nothing” option will slow the development of the following features:

  • IDE integration (code completion)
  • Extending R’s script based configuration with Guild file based configuration
  • Improved development and debugging support for plugins

Python based typing

Guild could formalize project configuration using Python data structures and Python’s type system. However, as Guild moves to become a correctly designed language neutral tool, it ought to rely less on Python-specific conventions and leverage cross language facilities where possible.

My first thought when reading this is that a formal schema is too heavy / not worth it, and that some additional docs, and front-loading most of the guildfile canonicalization work, and maybe a minor incremental step up in structure, would be enough. but I got curious to see what quarto did here (they have a similar problem). Seems like they went with a schema. So now I’m thinking maybe that’s the way to go.

For autocompletion, quarto compiles this: quarto-cli/yaml-intelligence-resources.json at 87ae6fc29f20b4954699be9b75ae90984e03e8de · quarto-dev/quarto-cli · GitHub from this directory of yml files: quarto-cli/src/resources/schema at 4f241dbede25c03e06e36b84a2cb9c0d02501f5e · quarto-dev/quarto-cli · GitHub

Investing in a robust schema solution here is probably worth it. For these reasons:

  • This is the best pathway to robust IDE autocompletion.

  • The complexity of merging configs that live in different places is hard, a schema will help.

  • Centralized config validation makes it possible to generate very informative and useful error messages.

  • The sure-footedness we’ll get for any tasks that involve inferring user-intent will speed up development for future work.

One nice feature right now is that some of the canonicalization work is lazy, enabling the interface to be fast. Figuring out how to keep things fast while also fully validating the config on every command seems hard. It might force us into some kind of caching.

Worth noting, I’m not opposed to Pydantic. It can be nice too. If we do go down that path, I would want to expand the support to enable (opt-in) run-time checks, and to make sure the type annotations are exercised/validated during tests. Having incorrect type annotations is 100x worse than than no type annotations. (This is not a theoretical concern, I’ve already encountered incorrect type annotations).

I would say though Pydantic and a schema are mostly orthogonal questions / solutions, solving different problems. We can do both, either or none. Pydantic isn’t tailored for the use case of validating values (or presenting those value choices to users), and that’s what is most useful for people working on a guild.yml file.

Current (very sway-able) vote: schema.

Notes from discussion with @alxmke

  • Only ever work with Python data struct - don’t use classes in guild.guildfile for anything other than a read only facade into the dict/list structures
  • Formalize the “resolution” phases (e.g. config source includes, inheritance, attr value includes, value coercion, etc.) up front (this will be automatic when the first point is solved)
  • Support for logging/viewing various phases (useful for debugging/testing, esp for plugin development)
  • Save resolved Guild file per run (could perhaps only save the opdef) — this is our official interface to the operation that can be used for full-features restarts, reports, etc.
  • Use run Guild file for restart support
  • Add --force-guildfile (or something along that line — what we really mean is “force project config”) to run command — analogous to --force-sourcecode but for Guild file changes
  • Drop weird encodings for Guild file related attrs in run (whatever these may be)
  • Plugins mutate the data, either directly or by way of helper functions
  • Provide assurances of data structure into Plugin callbacks and checks for structure out of callbacks