Locked runs (preserving run integrity)

Summary

Guild runs are in concept “read only” once they are completed. There are few reasons to ever change information associated with a run. Editing generated files or run source code, for example, changes the run results. Guild provides no features to safe guard against accidental run modification.

This proposal is under development and also awaiting feedback

Problem

Historically, the issue of run mutability is not a major problem. Users simply avoid modifying runs. The upcoming advent of the VS Code extension for Guild, however, introduces the more likely possibility of accidental changes to run files: the extension relies on default VS Code editors to view run files and these editors typically don’t support read-only mode.

A more subtle problem exists, however: linked directories from upstream runs may be written to by a run. This may be accidental or intentionally without awareness of the implications.

Proposed Approach

Introduce the concept of a locked run. A locked run has the following characteristics:

  • Run files are read-only
  • Run contains a lock file that lists all other run files with SHA hashes

A locked run can be verified using a Guild command (e.g. guild runs verify <run id>)

A locked run can be unlocked using a Guild command (e.g. guild runs unlock <run id>). This has the effect of making run files writable and deleting the lock file. An unlocked run cannot be verified (i.e. the verify command exits with a warning or error).

An unlocked run may be locked using a Guild command (e.g. guild runs lock <run id>). This has the effect of making run files read-only and re-generating the lock file used for verification.

There are some exceptions that must be considered:

  • Mutable run attributes: label, tags, comments
  • Run restarts

Mutable run attributes modified after the run is completed do not effect the integrity of the run.

Run restarts require full write access to the run directory. A restarted run is effective a new run that reuses the original run ID (see True Run Immutability below).

Alternative Approaches

Unverifiable Read Only Runs

We might consider a less costly improvement that does not support run verification. A locked run in this case would merely have read-only files and not contain a lock file. This could be an interim concept of a locked run with full verification being added later.

True Run Immutability

NOTE: This is not an alternative but rather a new feature that should have its own RFC. We include it here to consolidate the discussion/feedback around run integrity.

We might consider making runs strictly immutable. A run restart would create a new run using the contents of the original run directory as its starting point. This is similar to the --proto run command flag where all run files are copied (rather than just source code and run config).

The obvious cost to this approach is when very large files are copied. However, this cost could be mitigated with an --inplace flag (or similar) that caused Guild to use the old restart behavior. And even without --inplace, users are free to delete old runs if disk space is a problem.

The primary use case for restarting a run is to restart a failure with new source code or configuration. The restart takes advantage of any files (partial or otherwise) that the original run process created. By supporting an in place restart (current behavior) we corrupt the original failed run, making it impossible to diagnose or study problems in the future. A “true run immutability” scheme would preserve the failed run while allowing the new run to take advantage of any generated files.