Prepare data file

mislav · July 28, 2020, 9:55am

I would like to make prepare data file that is the same for several models.

In the documentation on pipelines (Pipelines) you show how to construct the pipeline with the prepare data file in the same model. But I would like to have prepare steps for several models (for example decision tree and random forest.)

I would like to know what is recommended way to do this inside guildai.

First thing that come to my mind is to have prepare file and source it from train script, but I am not sure if I can normaly call flags from that prepare file when running train operation from guild.

garrett · July 28, 2020, 11:27am

Do you want to run prepare-data once and use that one prepared data set for all models? Or you want to run separate prepare operations, each one creating a separate data set for use by each different model type?

The example shows a single data set that’s used by both train and test for a single model type. But these operations could just as easily be train-decision-tree and train-random-forest. The interface would be the same.

I don’t quite follow your last paragraph. From that it sounds like each model has its own prepare logic.

mislav · July 28, 2020, 1:17pm

@garrett, I want prepare for multiplie models (maybe even all models, I will see).

If I get it right you recommend following structure:

- model: multie_models
  operations:
    prepare:
       main ...
       flags ....
    train_random_forest:
       main: ....
      flags: ....
    train_decision_tree:
       main: ....
      flags: ....

garrett · July 28, 2020, 2:20pm

Yes that’s right. Though you could optionally use model objects to better represent what you’re doing there.

Here’s an example:

github.com

gar1t/guild-examples/blob/master/mislav-pipeline/guild.yml

- config: data-support
  resources:
    prepared-data:
      - operation: prepare-data

- operations:
    prepare-data:
      main: prepare_data
    pipeline:
      steps:
        - prepare-data
        - random-forest:train
        - decision-tree:train

- model: random-forest
  extends: data-support
  operations:
    train:
      main: random_forest
      requires: prepared-data

This file has been truncated. show original

You can run the pipeline operation to run the operations end-to-end.

If you prefer to use only operations, use this:

- operations:
    prepare-data:
      main: prepare_data

    train-random-forest:
      main: random_forest
      requires: prepared-data

    train-decision-tree:
      main: decision_tree
      requires: prepared-data

    pipeline:
      steps:
        - prepare-data
        - random-forest:train
        - decision-tree:train

  resources:
    prepared-data:
      - operation: prepare-data

mislav · July 28, 2020, 3:05pm

I see config for the first time.
Why do I need the config object in the first place? For example what if I remove config (and extend in models) and just define operations and models objects?
In above yml you define prepare-data operation 2 times, in config and than again in operations, I don’t see the reason for this?

garrett · July 28, 2020, 3:22pm

In this case config is used to define shared resources. The prepare-data operation is defined once in both examples.

It’s subtle, but prepared-data (notice the difference in naming convention) is the name of a resource. This is spelled this way so that requires: prepared-data reads better. You can name it whatever you want.

mislav · July 28, 2020, 3:44pm

How should I run prepare step only now? guild run prepare-data deosnt work

garrett · July 28, 2020, 3:58pm

How does it not work? You can clone the repo above and run the sample to confirm.

mislav · July 28, 2020, 6:20pm

I get:
ERROR: error in C:\Users\Mislav\Documents\GitHub\trademl\guild.yml: invalid value for operation ‘None’ ‘prepare’: expected a string or a mappinge[0m

garrett · July 28, 2020, 6:21pm

I’d need to see the Guild to help here.

mislav · July 28, 2020, 6:23pm

Here is the guild file:

github.com

MislavSag/trademl/blob/master/guild.yml

- config: data-support
  resources:
    prepare-data:
      - operation: prepare-data

- operations:
    prepare-data:
      description: Prepare data for ML model
      main: trademl.modeling.prepare
      flags‑import: all
      flags:
        num_threads:
          arg_name: num_threads
          description: Number of threads to use in mlfinlab multhithread function
          min: 1
          max: 32
        structural_break_regime:
          description: Shoud we use structural breaks and if yes, which one
          arg_name: structural_break_regime
          type: string

This file has been truncated. show original

garrett · July 28, 2020, 6:26pm

Delete this line:

github.com

MislavSag/trademl/blob/master/guild.yml#L85


          max: 1000000
    pipeline:
      steps:
        - prepare-data
        - random-forest:train

- model: random_forest
  extends: data-support
  description: Random forest model
  operations:
    prepare:
    train_random_forest:
      description: Trainer for random forest  
      main: trademl.modeling.train_rf  # Python module when running the operation
      requires: prepare-data
      flags‑import: all
      flags:
        num_threads:
          arg_name: num_threads
          description: Number of threads to use in mlfinlab multhithread function
          min: 1

Also, it’s spelled arg-name, not arg_name. Guild should complain about that but it doesn’t.

mislav · July 28, 2020, 6:28pm

Thanks a lot!

Topic		Replies	Views
How can I define models in guild and run them against different training procedures? General	1	537	March 22, 2022
Pipelines Concepts	0	2861	June 12, 2020
Flags sharing through operations General	15	1503	July 30, 2020
Data versioning General	15	2455	March 18, 2022
How to have optional run resources General	8	1177	October 5, 2022

Prepare data file

Related topics