New feature, models

I am using guildai for some time now. I have developed few ML models where all models accept already preprocessed data: X_train, y_train, X_test, y_test. Than random forest, lightgbm and xboost operations accepts this same data an fine tune the model.

I suppose most of people (90%) use the same pipiline, so I was thinking maybe it would bee good to set up some kind of template fot this models. We all use same flags when we use random forest model, only search space can be different. So I was thinking it would be good to incorprate something like blueprints or temaplates for most popular models. For exampe, here is my random forest model:

- model: random-forest
  extends: meta-model
  description: Random forest model
  operations:
    train:
      description: Trainer for random forest  
      main: trademl.modeling.train_rf  # Python module when running the operation
      requires: op-prepare
      sourcecode: 
        - include: '*.py'
      needed: no
      flags‑import: all
      flags:
        num_threads:
          arg_name: num_threads
          description: Number of threads to use in mlfinlab multhithread function
          min: 1
          max: 32
        sample_weights_type:
          description: Sample weights to use in training
          arg_name: sample_weights_type
          type: string
          default: 'returns'
          choices: [returns,time_decay,none]
        cv_type:
          description: type of cv
          arg_name: cv_type
          type: string
          default: 'purged_kfold'
          choices: ['purged_kfold']
        cv_number:
          description: Number of CV folds to use in CV
          arg_name: cv_number
          min: 1
          max: 20
        max_depth:
          description: Maximum depth for the tree in random forest algorithm
          arg_name: max_depth
          min: 1
          max: 10
        max_features:
          description: maximum number of featurs in random forest
          arg_name: max_features
          min: 1
          max: 250
        n_estimators:
          description: Number of estimators (decision trees) in random forest
          arg_name: n_estimators
          min: 1
          max: 10000
        min_weight_fraction_leaf:
          description: TODO
          arg_name: min_weight_fraction_leaf
          min: 0
          max: 1
        class_weight:
          description: sklearn class_weight argument
          arg_name: class_weight
          type: string
          default: 'balanced_subsample'
          choices: ['balanced','balanced_subsample']

It can save lots of time when you already have flags defined. Even, better, there can be already reasonable search spaces for the method. In the end, there can be a pipeline which combines all models and do some kind of AutoML. Since, the steps are almost the same, maybe it would be good to have one template guild file with defined flags.

I know this is not the main goal of guildAI, but I was thinking it would be great if there would be one big guildai file with many models with some default values. That would save lots of time to developers.

Guild supports this scenario via inheritance. To share a base model or configuration, you can include the source file or inherit from an installed package.

Inheritance via project-relative includes

To use includes:

# guild.yml

- include: guild-random-forest.yml

- model: my-random-forest
  extends: random-forest-base
  params:
    prepare-main: prepare
    train-main: train
# guild-random-forest.yml

- config: random-forest-base
  operations:
    prepare-data:
      main: '{{ prepare-main }}'
    train:
      main: '{{ train-main }}'
      requires: prepared-data
      flags: ...
  resources:
    prepared-data:
      - operation: prepare-data

This file guild-random-forest.yml is essentially the model definition you’re talking about. You can define an entire life cycle for whatever you want to easily share. All of that will be included in the project via the include top-level object (acting as a preprocessing directive in this case).

To share this file you can symlink to it or specify a relative path include like:

- include: ../templates/random-forest.yml

Inheritance via installed package

The other pattern would be to create an installable package that can be used this way:

# guild.yml in a project directory (e.g. ./my-random-forest-project)

- model: my-random-forest
  extends: gpkg.mislav_templates/random-forest-base
  params:
    prepare-main: prepare
    train-main: train
# guild.yml in a package project (e.g. ./guild-templates)
- package: gpkg.mislav_templates

- config: random-forest-base
  # see above for rest of example

To build and install the package, you’d go to the package project directory (e.g. `guild-templates) and run:

guild package && guild install dist/*.whl

Once installed, any project that references a package object using the package name ala gpkg.mislav_templates/<model or config>.

Note that the gpkg. namespace is used only as a convention. You can use whatever naming scheme you want. These are standard Python packages so it’s a good idea to use some sort of namespace.

@garret, I know for inheriatance (you already explained me :)). I was thinking it would be great to include those in the guild package, e.g. full radnom forest example, so it would easier for new people to apply guild on random forest and other models.

Guild can do that — as I mention — using packages. However, it’s challenging to find the right package definition that is actually reusable. What seems reusable often isn’t and you end up with an abstraction that isn’t flexible enough and that no one uses.

Guild packages are completely federated though - you’re free to create them and publish them to PyPI for anyone to use.