Use glob patterns for file source select by default with option for regex

Summary

This proposal seeks to simplify the process of specifying file patterns. It addresses the confusion and complexity associated with regular expressions in file patterns used for resource source select specs. We suggest that Guild use glob patterns by default for select with the option of using select-regex as an alternative attribute for specifying regular expression patterns.

This proposal is under development.

Problem

Guild’s select specification for resource sources uses Python regular expressions. For example, the following configuration is used to select files ending with .txt:

op:
  requires:
    - file: .
      select: .+\.txt

We find this syntax burdensome for such a simple goal. The syntax is also different from that used for sourcecode (operation and model attribute) and data-files (package attribute) to match files.

Proposed Approach

We propose a breaking change to re-interpret the select patterns used for resource sources as glob patterns rather than as regular expressions. To support regular expressions, which provide considerably more flexibility, we propose a new attribute select-regex, which may be specified as a mutually exclusive alternative to select.

Under this approach, the example above is changed to:

op:
  requires:
    - file: .
      select: '*.txt'

Note that this value is single quoted due to YAML’s syntax requirements.

Migrating users

As this is a breaking change, we need a migration strategy that does gives users an easy path to migrate their configuration that does not disrupt their work.

We propose a deprecation period that supports current projects but warns users of an upcoming, breaking change.

During the deprecation period, Guild attempts to detect a regular expression and uses the value as such while logging a warning message. The warning message should instruct the user to rename the attribute to select-regex to continue using the pattern without warning.

[WARNING]: resource source 'file' appears to be using a regular 
expression in 'select'. Support for regular expressions using 'select' is
deprecate. Use 'select-regex' instead. In Guild 0.8, this value will be 
used as a glob pattern.

Alternative Approaches

Specify a regex using new syntax

Both glob and regular expression syntax could be supported for a single select attribute.

For example, Python designates regular expression values using r'...'. JavaScript supports them as /../. The JavaScript notation would certainly not be suitable for specifying paths.

Various notations are explored below.

This approach has the advantage of establishing a common syntax for file select expressions that can be used for other settings including those specified as command line options.

JavaScript syntax

select: /^/foo/bar/.+\.txt$/

This is less-than-ideal for defining values with paths. For example, /foo/ would be interpreted as a regular expression of the value foo, which is quite different from what it looks like.

Python psuedo-syntax

select: r'^/foo/bar/.+\.txt$'

This is slight-of-hand. This looks like a novel string-ish type but In YAML it’s "r'^/foo/bar/.+\.txt$'".

This syntax falls over when used in a shell:

$ echo r'hello'
rhello

Explicit prefix

select: regex:^/foo/bar/.+\.txt$

While this syntax requires a lengthy prefix, it is clearly denoted.

Auto-detect glob vs regex

Guild could attempt to detect a glob expression and use the corresponding regular express automatically.

This approach should be rejected because it introduces implicit behavior that is hard to debug. There are no tools that we are aware of that use this approach.

New select-glob attribute

Rather than introduce a breaking change, add a new attribute select-glob, which is used with glob expressions. In this case, the example above is changed to:

op:
  requires:
    - file: .
      select-glob: '*.txt'

This is a viable approach but it suffers from two problems:

  • We believe that the majority of cases, glob patterns sufficient for selection. The default should correspond to the majority case.

  • The term glob is systems jargon, like regex. The more technical case should be the exception and not the default.

  • Until Guild reaches 1.0, we are not constrained to non-breaking changes.

The advantage of this approach is that it maintains compatibility with existing projects and avoids the need to support a deprecation period.

Hi @garrett!

Several thoughts on this one.

  1. Many people in the field of ML are not coming from programming initially, so not everyone are familiar with regex syntax. So the “glob” syntax seem to be a better choice.

  2. When dealing with file paths, I personally would expect default support for “glob” syntax

  3. I suggest another attribute: use-regex. By default it would be set to True during deprecation period to maintain backward compatibility. If the attribute is not found in the guild.yml during the period, the warning message would be printed. After that the default value for use-regex would be False. So the final syntax would be

op:
  requires:
    - file: .
      select: '*.txt'

for regular “glob” syntax and

op:
  requires:
    - file: .
      select: .+\.txt
      use-regex: True

for the one with regex. The suggested approach IMO is better because it does not introduce a new attribute (select-regex) to do essentially the same job as select. Also it would be possible to use “glob” syntax starting from the next update by setting use-regex to False explicitly.

P.S. while I was aware of existence of “glob”, I have never actually heard anyone using it. The term used was “wildcard matching”.

Hi @igrinis - thanks for the great feedback and welcome!

The advantage of using select-regex is that it accomplishes the same job with a single attribute, rather than having to type two attributes.

I wonder though about even needing this. I missed one option in the proposal, which is to introduce a syntax that can be used to denote a regular expression.

Python, for example, uses r"..." and JavaScript lets you designate a regex with /.../.

I’ll update the proposal with another alternative approach.

Simple and consistent syntax IMO certainly worth additional line. I don’t mind adding another line, if it contributes to the clarity and will save me later a few minutes of search and the overhead of task switching. What I like in Guild is the concept of experiment manifest. And I expect this manifest to be exact, univocal and as simple as possible. Having several similar options reduce clarity and increase entropy.

upd:
I like the regex prefix. Not the most elegant solution, but does the job.

1 Like

I’d personally much prefer glob patterns, which is used much more often in filename selection/completion than regex is. Glob (I think) has a narrower set of functionality, but should be more than sufficient in filtering and selecting from files.

Hi @jethro and welcome! Thanks for your input. I agree, globs (maybe we call these wildcard patterns as @igrinis observes, the term “glob” is a bit technical and not many people have heard of it!) are much easier to write.

I’m not sure why this hasn’t occurred to me, but Guild uses a very flexible model for selecting and filtering files for source code files. Why aren’t all these schemes unified? Unless there’s a good reason (I can’t offhand think of one but there may be) I think they should all work the same.

E.g. however regex support is added back to resource source selection should be available for source code and other file selection specs in Guild.