Organize experiment suite

This is the case of a longer term project with multiples of:

  • experiments
  • models
  • datasets directory

Perhaps mix and matched in different ways. Basically lots of experimenting.
Not sure if git modules should be used or just one big repo.

How to organize ongoing projects with multiple models and experiments? I was thinking something like below, and use site.addsitedir() In each experiment script to include the:

  • project root
  • models directory
  • datasets

Or am I thinking of this wrong?

root
  Datasets
  Experiments
  	Experiment_1
    	guild1.yml
    	main_script1.py
  	Experiment_2
    	guild2.yml
    	main_script2.py
  	Experiment_3
    	guild3.yml
    	main_script3.py
  Models
  	model_A
    	__init__.py
  	model_B
    	__init__.py
  	model_C
    	__init__.py

I’d have to better understand what you mean by experiments in this case. With Guild, an experiment would be to run a script with some set of inputs. That script might train a model by loading its representation, initializing it, and then using a trainer of some sort to update the representation using examples.

There’s no experiment-in-code in that case. Rather, there’s a general purpose training script that takes user input to run a parameterized operation. The experiment is then a record of a set of inputs and the results.

Regarding organizing your code, I’d start with a single script per model-that-you-want-to-train. When I say script, I mean some Python module. This could be anywhere in your source tree. If you want to create a deployable package at some point, that module should be inside a Python package.

root/
  setup.py
  my_ml_python_package/
    __init__.py
    train_model_foo.py
    guild.yml

This is the simplest possible layout I can think of. my_ml_python_package is the Python package for your project. I’d stick with one package until you clearly need more.

A good reason to use multiple packages would be that one part of your code base is released much more frequently than another and you want to avoid releasing the other — the other is stable and relied on by other packages so you minimize its revisions. But I’d only make this move once you’ve actually seen and felt this problem, not before.

Regarding data sets and models, you can start by putting literally every bit of code inside train_model_foo.py to start. This might seem like a terrible idea, but consider it. When you add the next script — e.g. train_model_bar.py — you’re going to spot code that you reuse. It’s quite obvious. At that point you can reason about what you’re reusing and move it a sensible third module.

In my experience, I see three general types of modules emerge from this method:

  • Model construction
  • Data set loading and preparation
  • Utilities

So when you move to your second training script — e.g. bar — your project layout might naturally evolve to look like this:

root/
  setup.py
  my_ml_python_package/
    __init__.py
    train_model_foo.py
    train_model_bar.py
    models.py
    data.py
    util.py
    guild.yml

In time, you might find that models.py becomes incoherent and decide to break it up into separate modules. E.g. model_foo.py and model_bar.py. You could create a models sub-package like this:

root/
  my_ml_python_package/
    __init__.py
    models/
      __init__.py
      bar.py
      foo.py
    ...

Personally, I’d opt for this:

root/
  my_ml_python_package/
    __init__.py
    model_bar.py
    model_foo.py
    ...

Either structure is fine. I prefer the later because I find it simpler. I know plenty of people who strongly argue for nesting. This is Python though, so flat is better than nested :wink:

A good reason to factor large modules into separate modules is when authors are routinely resolving commit conflicts. That’s a sign to peel a module apart to help authors get out of one another’s way. Lines of code is another reason, but I think less compelling. I prefer to use “coherency”, “grokkability”, “maintainability” as standards for these decisions.

To summarize:

  • Start with a single top-level Python package
  • Drive the creation of your Python modules from the scripts that you execute to do things (i.e. the main modules in your Guild operations)
  • Factor code into separate modules in sensible ways as needed based on obvious, defensible pain points (examples include code reuse, maintainability, and routine commit conflicts)
  • At all points avoid speculating — base the evolution of your source code structure on actually felt pain points

Okay, try this:

exp1/
  model_combine1.py
  model_A
  model_A

exp2/
  model_combine2.py
  model_A
  model_B

exp3/
  model_combine3.py
  model_B
  model_C

exp4/
  model_combine4.py
  model_C
  model_A

We are reusing models, or lets say ‘sub-models’, in different composite models, mixing and matching.

For nesting, it’s already better for us to have the singular models broken up into several source files. Like for example transformer models; they are usually split up along encoder, decoder, self-attention head, etc.
ie:

model_A
  __init__.py
  A_encoder.py
  A_decoder.py
  A_encoder_layer.py
  A_decoder_layer.py

Innovation and hence code modifications can occur in both the
composite larger model model_combine<x>.py, and/or in the member sub-models, ie. A_encoder_layer.py
This is all in the research phase.

I’m still trying to wrap my head around Guild saving source code on every run. How does the workflow work? Is this a kind of replacement for git? For rapid prototyping?
If I edit source code in place, one thing I can do is to symlink auxiliary source code (say sub-modules) and then swap in and out symlinks. In place code like model_combine.py would have versions saved by guild for every run.
I would want an easy way to access all the versioned code for quick diffing.

As an aside, how about symlinking source code?

Rethinking this:
Your summary points - you are right this is great advice.
A better description of the issue - using resources from outside the project:

  1. unique to the project and in the (super?)root directory
  2. shared resources may or may not be in root
    2.1 dvc sets a precedent; they use symlinks for datasets
  3. When to package

Options for organising:

  1. set PYTHON_PATH env var
  2. hard-coded path modification: sys.path.append, site.addsitedir, etc
  3. symlink
  4. installed python package available via PATH, possibly in a venv
    4.1 when to use pip install -e or not?
  5. ? etc

I will take your advice and try things out and see what happens

Guild copies source code for two reasons:

  • The experiment has a copy of the code that is run when the experiment starts
  • Changes to project code during a run will not impact the run

The second point is quite important. As many operations in ML are long-running, it’s possible, if not likely, that an innocent change to project code (i.e. the code sitting in your project directory) is accidentally loaded by a running operation if the operation loads from the project location. It must not. The code must be copied to the run directory and loaded from that location.

Guild goes out of its way to prevent the project directory from ending up in the Python path for a run. If the run directory does not contain the required source code, the run fails. That’s by design.

The alternative would be to lock every source code file in your project during a run to make sure a change doesn’t accidentally break a run. Imagine that.

It’s also important to have an accurate copy of the source code that’s actually run. Without copying, there’s no way you can debug issues associated with a run.

There are some experiment management tools that restrict you to using a git commit. This is safe because you can trace back the run source code to a precise set of files. But this is rather absurd IMO from a workflow standpoint as your git repo is forced to record every commit associated with every experiment. I would flat out refuse to use such a system as an experiment because a) it’s a pain, b) it needlessly clutters your source repo commit history, and c) it’s error prone depending on how the tool is implemented.

If you’re using source code from outside the project, I recommend a few approaches:

  • If the code doesn’t change routinely, consider deploying it within your Python environment (virtual or otherwise, you can get into about a current environment by running guild check. This will let you version the code and ensure it doesn’t change without detection. Guild automatically generates a pip freeze record for each run, which you can use to detect any changes in library versions across runs.

  • Use symlinks or git submodules within your active project to the shared code. This will result in the shared code being copied as project code.

  • Use a parent directory location as root in the sourcecode spec for an operation:

op:
  sourcecode:
    root: ..

This assumes your active project root is a peer directory of the shared projects. E.g.

/current_project/
  guild.yml
/shared_mod/
  __init__.py
  ...  

It’s important to know exactly what runs with each experiment. I would avoid any trickery with PYTHONPATH at operation runtime. Instead, rely exclusively on source code copies or versioned libraries.

This approach forces you to get sourcecode right, which can take a little effort. If something doesn’t run because you’re missing source code, you need to address that via the sourcecode attribute. Use guild run op --test-sourcecode to see what’s being copied from where. You can adjust the sourcecode spec and use --test-sourcecode to see what will be copied. Obviously, running your code is the key test.

In this process you might consider putting your code into a “run all code paths through super fast” mode that you can use to confirm that the operation is running through as expected. This is also useful for “sniff tests” as well.