I’d have to better understand what you mean by experiments in this case. With Guild, an experiment would be to run a script with some set of inputs. That script might train a model by loading its representation, initializing it, and then using a trainer of some sort to update the representation using examples.
There’s no experiment-in-code in that case. Rather, there’s a general purpose training script that takes user input to run a parameterized operation. The experiment is then a record of a set of inputs and the results.
Regarding organizing your code, I’d start with a single script per model-that-you-want-to-train. When I say script, I mean some Python module. This could be anywhere in your source tree. If you want to create a deployable package at some point, that module should be inside a Python package.
root/
setup.py
my_ml_python_package/
__init__.py
train_model_foo.py
guild.yml
This is the simplest possible layout I can think of. my_ml_python_package
is the Python package for your project. I’d stick with one package until you clearly need more.
A good reason to use multiple packages would be that one part of your code base is released much more frequently than another and you want to avoid releasing the other — the other is stable and relied on by other packages so you minimize its revisions. But I’d only make this move once you’ve actually seen and felt this problem, not before.
Regarding data sets and models, you can start by putting literally every bit of code inside train_model_foo.py
to start. This might seem like a terrible idea, but consider it. When you add the next script — e.g. train_model_bar.py
— you’re going to spot code that you reuse. It’s quite obvious. At that point you can reason about what you’re reusing and move it a sensible third module.
In my experience, I see three general types of modules emerge from this method:
- Model construction
- Data set loading and preparation
- Utilities
So when you move to your second training script — e.g. bar
— your project layout might naturally evolve to look like this:
root/
setup.py
my_ml_python_package/
__init__.py
train_model_foo.py
train_model_bar.py
models.py
data.py
util.py
guild.yml
In time, you might find that models.py
becomes incoherent and decide to break it up into separate modules. E.g. model_foo.py
and model_bar.py
. You could create a models
sub-package like this:
root/
my_ml_python_package/
__init__.py
models/
__init__.py
bar.py
foo.py
...
Personally, I’d opt for this:
root/
my_ml_python_package/
__init__.py
model_bar.py
model_foo.py
...
Either structure is fine. I prefer the later because I find it simpler. I know plenty of people who strongly argue for nesting. This is Python though, so flat is better than nested
A good reason to factor large modules into separate modules is when authors are routinely resolving commit conflicts. That’s a sign to peel a module apart to help authors get out of one another’s way. Lines of code is another reason, but I think less compelling. I prefer to use “coherency”, “grokkability”, “maintainability” as standards for these decisions.
To summarize:
- Start with a single top-level Python package
- Drive the creation of your Python modules from the scripts that you execute to do things (i.e. the
main
modules in your Guild operations)
- Factor code into separate modules in sensible ways as needed based on obvious, defensible pain points (examples include code reuse, maintainability, and routine commit conflicts)
- At all points avoid speculating — base the evolution of your source code structure on actually felt pain points