Prepare data file

I would like to make prepare data file that is the same for several models.

In the documentation on pipelines (Pipelines) you show how to construct the pipeline with the prepare data file in the same model. But I would like to have prepare steps for several models (for example decision tree and random forest.)

I would like to know what is recommended way to do this inside guildai.

First thing that come to my mind is to have prepare file and source it from train script, but I am not sure if I can normaly call flags from that prepare file when running train operation from guild.

Do you want to run prepare-data once and use that one prepared data set for all models? Or you want to run separate prepare operations, each one creating a separate data set for use by each different model type?

The example shows a single data set that’s used by both train and test for a single model type. But these operations could just as easily be train-decision-tree and train-random-forest. The interface would be the same.

I don’t quite follow your last paragraph. From that it sounds like each model has its own prepare logic.

@garrett, I want prepare for multiplie models (maybe even all models, I will see).

If I get it right you recommend following structure:

- model: multie_models
  operations:
    prepare:
       main ...
       flags ....
    train_random_forest:
       main: ....
      flags: ....
    train_decision_tree:
       main: ....
      flags: ....

Yes that’s right. Though you could optionally use model objects to better represent what you’re doing there.

Here’s an example:

You can run the pipeline operation to run the operations end-to-end.

If you prefer to use only operations, use this:

- operations:
    prepare-data:
      main: prepare_data

    train-random-forest:
      main: random_forest
      requires: prepared-data

    train-decision-tree:
      main: decision_tree
      requires: prepared-data

    pipeline:
      steps:
        - prepare-data
        - random-forest:train
        - decision-tree:train

  resources:
    prepared-data:
      - operation: prepare-data

I see config for the first time.
Why do I need the config object in the first place? For example what if I remove config (and extend in models) and just define operations and models objects?
In above yml you define prepare-data operation 2 times, in config and than again in operations, I don’t see the reason for this?

In this case config is used to define shared resources. The prepare-data operation is defined once in both examples.

It’s subtle, but prepared-data (notice the difference in naming convention) is the name of a resource. This is spelled this way so that requires: prepared-data reads better. You can name it whatever you want.

How should I run prepare step only now? guild run prepare-data deosnt work

How does it not work? You can clone the repo above and run the sample to confirm.

I get:
ERROR: error in C:\Users\Mislav\Documents\GitHub\trademl\guild.yml: invalid value for operation ‘None’ ‘prepare’: expected a string or a mappinge[0m

I’d need to see the Guild to help here.

Here is the guild file:

Delete this line:

Also, it’s spelled arg-name, not arg_name. Guild should complain about that but it doesn’t.

Thanks a lot!

1 Like