This proposal outlines “project aware” Guild home default locations.
A project is a directory that Guild infers as a top-level container for user code. Projects generally correspond to Git repositories, though not necessarily.
Users work with projects to maintain source code for various applications — e.g. deployable software, model development, data management, etc. By convention, Guild runs are commonly tied to projects.
Guild should store project related runs within the project directory by default, rather than under the user’s home directory (i.e.
[This is a breaking change.]
This proposal is under development
Guild runs are commonly generated for project work. For example, a team of engineers and data scientists may develop a model, which requires source code, model configuration, prepared data, and tooling to generate trained models and compare outcomes. This work is performed in the context of a project — i.e. a top-level directory that contains the source code used for the work.
Guild stores runs under Guild Home, which is
~/.guild by default. To accommodate project workflows in Python, Guild changes the default for Guild Home when a virtual environment is activated. For example, if a virtual environment is located in ~/my-project/.venv
, the default Guild Home is~/my-project/.venv/.guild
when that environment is activated. This is the same for both traditional Python virtual environments (those created usingvirtualenv
standard Python module) and those created usingconda`.
The spirit of this modified behavior — using the location of activated virtual environments to store runs — is motivated by a desire to consolidate runs for a particular project, as separate virtual environments are commonly used per project in Python workflows.
This scheme presents a few issues:
Virtual environments are not strictly aligned with project work. A user may opt to not use a virtual environment when running project code or a user might use multiple virtual environments. Guild would store runs in different locations under these different circumstances (no activated environment and each of the applicable environments, depending on what’s used). This forces users to set
GUILD_HOMEas a workaround to consolidate runs to a single project location.
Virtual environments are VM constructs and do not typically store ‘var’ style – i.e. routinely updated data – data. A user or engineer who’s unaware of Guild’s default behavior might unintentionally delete runs when cleaning an environment.
Languages that don’t use virtual environments (e.g. R) can’t signal to Guild that runs should be located in a project.
This proposal entails two changes:
- A file layout heuristic for inferring a user project
- A heuristic for inferring Guild home using the current directory and file layout, including the existence of a project
Guild will infer that a directory is a project if it contains a file designating it as such. Project-designating files must match at least one pre-designated project file pattern. A pattern consists of a glob pattern and a file type designation. Glob pattern are applied to the project directory relative path of the file. A file type designation is one of: file, dir, or any.
Below is a working list of such patterns:
||dir||Explicit indicator of a Guild home location (repository)|
||file||R session init file|
||file||R session activation file (TODO: is this needed given
||file||PEP 518 build dependency file|
Consider a simple directory with the file
guild run test.py from that directory, Guild will look for Guild home in the parent directory, up to and including the home directory or root. If it fails to find a Guild home, it will use
~/.guild. Let’s say that’s the case and the first run lands in
What happens when the user adds
pyproject.toml to the directory? I can imagine one of two things. Option 1: Guild sees
- Is a Git repo (e.g. directory
.git/objects) not sufficient to identify 90% of all projects? The issue of course is that projects are often started without initializing the repo.
As Guild home is used for nearly every command, its resolution must be efficient. E.g. the implementation cannot rely on Python entry points (e.g. to delegate project file inference to plugins) as the underlying support for this feature is remarkably inefficient (e.g. 200ms just to import
TODO - note option of
gulid init and any required change to expose this to the user