IMPORTANT: We want to copy source code to the run root - feedback requested

Summary

This proposal seeks to address issues related to source code location for a run.

This is NOT a breaking change.

This proposal is awaiting feedback

Problem

Guild currently copies project source code to .guild/sourcecode by default. This presents two problems:

  1. Users are can be surprised that source code does not exist where they expect it to be.

  2. Some Guild language providers cannot reasonably run an operation when source code does not maintain the same directory structure relative to the current directory.

Guild separates source code from other run-related files (inputs and outputs). This is by design, with the intent of keeping the run directory “clean”. It was believed that there’s some value in starting with an empty directory and then requiring explicit configuration of that directory. In practice, this requirement is often onerous and forces Guild to implement various “smart” workarounds to help the user. The result encourages explicit configuration, but at the cost of surprising behavior and complexity.

There is no technical reason for this separation. In fact, the scheme presents technical challenges, as in the case of R language support.

Proposed Approach

We propose to remove the distinction between “source code location” and the run directory by changing the default location for source code to . (i.e. the run directory root). We would note in documentation that the sourcecode.dest operation attribute should be used to support legacy run formats.

We would modify run file list commands to include all files by default and provide options to filter by source code, inputs, and generated files.

Here are some examples of modified ls command:

guild ls  # shows all files in run dir, which would include source code
guild ls --sourcecode  # shows only source code files
guild ls --deps  # shows only resolved dependencies
guild ls --generated  # shows only generated files

Guild View would similarly be changed to support filtering by source code, dependencies and generated files.

Backward compatibility

Guild would continue to maintain support for alternative source code destinations and provide thorough test coverage for these. New runs would default to the new scheme. Users who prefer the old scheme could specify .guild/sourcecode in their Guild file as needed.

The notion of “source code destination” would be considered a legacy topic but the distinction would be maintained in code.

Differentiating source code from other run dependencies

Guild would continue to support sourcecode filter runs as they exist today. These will be used to differentiate “source code” from other project local dependencies. Files that are not selected through these rules would not be copied to the run directory as a part of the source code initialization stage. They would need to be specified as dependencies using the requires operation attribute just as they are today.

See Project filtering below for a future enhancement not considered by this proposal.

Alternative Approaches

Language plugins configure source code destinations as needed

Guild supports a solution today for languages that prefer (or require) code structure to remain consistent between the project directory and the run directory. The R plugin, for example, would implicitly set the source code destination to the run directory root for R-based operations.

Other language plugins (e.g. the amusing Erlang plugin example) could follow this pattern as needed.

Python operations would otherwise work as they do today.

The advantage of this approach is that it avoids impacting current Guild users. We may under-appreciate the value users ascribe to separating source code from the run directory root.

The disadvantage is that runs behave differently depending on the operation language. We also los the benefits to Python developers of having a straight-forward run directory that closely mirrors the project structure. The bifurcation between source code structure and other run files has long been a source of confusion for some users and it’s not clear there’s a practical payoff.

Future enhancement not considered in this proposal

Project filtering

With this proposal, Guild treats a run as a full or limited copy of the project structure. This is a direct approach to initializing a run:

  • Copy project files to the run directory as they appear in the project directory
  • Resolve other dependencies, potentially replacing project files based on dependency configuration

This approach impacts two topics: source code specs and project local dependencies.

Guild currently provides a robust but complicated scheme for specifying “source code” to be copied to the run directory. This scheme ranks among the most difficult topics for Guild users.

Guild currently requires non-source files (i.e. files that aren’t already copied using Guild’s implied or arcane source code filtering rules) to be explicitly listed as dependencies using the requires operation attribute. This is easily the top-ranked point of confusion for Guild users.

This proposal addresses both of these topics and can additionally simplify configuration by treating run init as a “what to ignore from the project” problem. This is similar to Git’s “what to ignore” scheme, implemented by .gitignore.

While this feature is decoupled from this proposal (which is limited to changing the default source code location and updating documentation), we should consider additional features:

  • Adopt a “what to ignore” scheme, which can be used to prune a run init by ignoring specified files and directories
  • Consider .gitignore as a default heuristic for “what to ignore” regarding source code
  • Consider dvc.yaml or other mainstream data management specs as a default heuristic for dependencies (i.e. non-source code files required by an operation)
  • Add a new dependency type dir as a semantic improvement over file in cases where a project local dependency is a directory

Ignoring source code

Guild would use .gitignore when generating a list of source code files to copy. The operation could provide alternative configuration:

train:
  sourcecode:
    ignore:
      - @gitignore  # possible config to use .gitignore rules at this location
      # additional rules
      - '*.pyc' 
      - /data

I don’t have the knowledge to assess this proposal from technical point, but I think it’s a good idea. This would not impact adversely anything I do, and make source code placing more intuitive.

2 Likes

I’d welcome this change. An example from python where this currently is cumbersome: If your project loads data from a local file, currently os.path is set to the project root where by default nothing is available. Currently doing a workaround and defining separate resources.

No problem for my workflow.