New feature, models

mislav · September 28, 2020, 1:01pm

I am using guildai for some time now. I have developed few ML models where all models accept already preprocessed data: X_train, y_train, X_test, y_test. Than random forest, lightgbm and xboost operations accepts this same data an fine tune the model.

I suppose most of people (90%) use the same pipiline, so I was thinking maybe it would bee good to set up some kind of template fot this models. We all use same flags when we use random forest model, only search space can be different. So I was thinking it would be good to incorprate something like blueprints or temaplates for most popular models. For exampe, here is my random forest model:

- model: random-forest
  extends: meta-model
  description: Random forest model
  operations:
    train:
      description: Trainer for random forest  
      main: trademl.modeling.train_rf  # Python module when running the operation
      requires: op-prepare
      sourcecode: 
        - include: '*.py'
      needed: no
      flags‑import: all
      flags:
        num_threads:
          arg_name: num_threads
          description: Number of threads to use in mlfinlab multhithread function
          min: 1
          max: 32
        sample_weights_type:
          description: Sample weights to use in training
          arg_name: sample_weights_type
          type: string
          default: 'returns'
          choices: [returns,time_decay,none]
        cv_type:
          description: type of cv
          arg_name: cv_type
          type: string
          default: 'purged_kfold'
          choices: ['purged_kfold']
        cv_number:
          description: Number of CV folds to use in CV
          arg_name: cv_number
          min: 1
          max: 20
        max_depth:
          description: Maximum depth for the tree in random forest algorithm
          arg_name: max_depth
          min: 1
          max: 10
        max_features:
          description: maximum number of featurs in random forest
          arg_name: max_features
          min: 1
          max: 250
        n_estimators:
          description: Number of estimators (decision trees) in random forest
          arg_name: n_estimators
          min: 1
          max: 10000
        min_weight_fraction_leaf:
          description: TODO
          arg_name: min_weight_fraction_leaf
          min: 0
          max: 1
        class_weight:
          description: sklearn class_weight argument
          arg_name: class_weight
          type: string
          default: 'balanced_subsample'
          choices: ['balanced','balanced_subsample']

It can save lots of time when you already have flags defined. Even, better, there can be already reasonable search spaces for the method. In the end, there can be a pipeline which combines all models and do some kind of AutoML. Since, the steps are almost the same, maybe it would be good to have one template guild file with defined flags.

I know this is not the main goal of guildAI, but I was thinking it would be great if there would be one big guildai file with many models with some default values. That would save lots of time to developers.

garrett · September 28, 2020, 3:22pm

Guild supports this scenario via inheritance. To share a base model or configuration, you can include the source file or inherit from an installed package.

Inheritance via project-relative includes

To use includes:

# guild.yml

- include: guild-random-forest.yml

- model: my-random-forest
  extends: random-forest-base
  params:
    prepare-main: prepare
    train-main: train

# guild-random-forest.yml

- config: random-forest-base
  operations:
    prepare-data:
      main: '{{ prepare-main }}'
    train:
      main: '{{ train-main }}'
      requires: prepared-data
      flags: ...
  resources:
    prepared-data:
      - operation: prepare-data

This file guild-random-forest.yml is essentially the model definition you’re talking about. You can define an entire life cycle for whatever you want to easily share. All of that will be included in the project via the include top-level object (acting as a preprocessing directive in this case).

To share this file you can symlink to it or specify a relative path include like:

- include: ../templates/random-forest.yml

Inheritance via installed package

The other pattern would be to create an installable package that can be used this way:

# guild.yml in a project directory (e.g. ./my-random-forest-project)

- model: my-random-forest
  extends: gpkg.mislav_templates/random-forest-base
  params:
    prepare-main: prepare
    train-main: train

# guild.yml in a package project (e.g. ./guild-templates)
- package: gpkg.mislav_templates

- config: random-forest-base
  # see above for rest of example

To build and install the package, you’d go to the package project directory (e.g. `guild-templates) and run:

guild package && guild install dist/*.whl

Once installed, any project that references a package object using the package name ala gpkg.mislav_templates/<model or config>.

Note that the gpkg. namespace is used only as a convention. You can use whatever naming scheme you want. These are standard Python packages so it’s a good idea to use some sort of namespace.

mislav · September 29, 2020, 7:14am

@garret, I know for inheriatance (you already explained me :)). I was thinking it would be great to include those in the guild package, e.g. full radnom forest example, so it would easier for new people to apply guild on random forest and other models.

garrett · September 30, 2020, 4:08pm

Guild can do that — as I mention — using packages. However, it’s challenging to find the right package definition that is actually reusable. What seems reusable often isn’t and you end up with an abstraction that isn’t flexible enough and that no one uses.

Guild packages are completely federated though - you’re free to create them and publish them to PyPI for anyone to use.

Topic		Replies	Views
Forest optimizer example General	6	1139	July 22, 2020
Prepare data file General	12	792	July 28, 2020
NameError: guild doesnt recognize defined variable? Troubleshooting	9	1714	July 23, 2020
Guild File Cheatsheet Cheatsheets	0	4020	June 19, 2020
Get Started: Add a Classifier Get Started	1	3459	November 11, 2021

New feature, models

Inheritance via project-relative includes

Inheritance via installed package

Related topics