Debugging and profiling guild

I sometimes see guild taking a long time to start a run compared to just running the command that I get from --print-cmd. I realise this is because guild has to resolve dependencies etc., but I would like to understand if there is an easy way to debug and especially profile what steps / operations that is expensive in the guild command.

I am aware of the guild --debug flag, but in my particular case it doesn’t provide much info about what is taking a long time.

Run your command with PROFILE=1 env var like this:

PROFILE=1 guild run ...

You’ll get a couple of Python profile stats written.

A nice way to view these files is with SnakeViz. Guild prints the instructions for running snakeviz when you run with the profile flag.

If you need help interpreting anything just attach both stat files and I’ll take a look.

The shorter you can make your code the better. Otherwise the startup cost/time will be overshadowed by the actual run time.

That’s pretty sweet!

I’ve profiled the run that takes a long time. You can find the results here.

As you can see it takes a looong time. This happened after I did a big refactoring of my guild file into multiple guild files using inheritance.

Is the above happening because it is looking for source code in the root dir?

It looks like you have a directory with a lot of files - over 1M. Guild is example those files to see if they’re candidates for source code copy. By default Guild only looks at I think around 100 files unless you’ve configured the sourcecode attr for the operation.

You can see what’s going on by running:

guild run <op> --test-sourcecode

This should take all that time but you’ll see where the files are.

You can remove a directory from consideration (Guild won’t scan it) this way:

op:
  sourcecode:
    - exclude:
        dir: <dir containing lots of files>

So I have the sourcecode attribute specified, but I guess not it in the right way.

My current folder structure looks like this:

training/
scripts/
guild/
   flags/
           classification.yml
           common.yml
           segmentation.yml
    base_model.yml
    classification_model.yml
    segmentation_model.yml
    utils.yml      
guild.yml

My main guild.yml looks like this:

- include: guild/segmentation_model.yml
- include: guild/classification_model.yml

These two guild files in turn looks like this:

guild/segmentation_model.yml
-----------------------------
- include:
    - base_model.yml
    - utils.yml
    - flags/segmentation.yml
    - flags/common.yml

- model: segmentation_model
  sourcecode:
    - scripts
    - training
    - guild.yml
  extends:
    - base_model
    - utils
  operations:
    convert_to_onnx:
      flags:
        $include: onnx_flags_segmentation
    train:
      flags:
        batch_size: 1
        $include:
          - segmentation_flags
          - train_flags
          - common_flags
    test:
      flags:
        $include:
          - common_flags
          - test_flags
guild/classification_model.yml
-------------------------------
- include:
    - base_model.yml
    - utils.yml
    - flags/classification.yml
    - flags/common.yml

- model: classification_model
  sourcecode:
    - scripts
    - training
    - guild.yml
  extends:
    - base_model
    - utils
  operations:
    train:
      flags:
        batch_size: 0
        $include:
          - classification_flags
          - train_flags
          - common_flags
    convert_to_onnx:
      flags:
        $include: onnx_flags_classification

The base_model.yml looks like this:

base_model.yml
--------------------------------
- config: base_model
  sourcecode:
    select:
      - scripts
      - training
      - guild.yml
  operations:
    train:
      main: scripts/training/train_model --input_database ...
      requires: prepared_data
    test:
      main: scripts/training/test_model --input_database ...
      requires:
        - operation: train
        - prepared_data
  resources:
    prepared_data:
      sources: ...

The command

guild run classification_model:train

Is what takes a long time. The interesting thing is that the sourcecode directory in the run directory only contains the sourcecode that I have specified.

$ ls ~/.../.guild/runs/5d239a67d97d4bd4952e2b1cc2b10083/.guild/sourcecode/
guild.yml  scripts  training

It’s the scanning/testing of a large number of files that’s taking time.

What does this command reveal?

guild run classification_model:train --test-sourcecode

It scans through the entire directory:

training/
data/
3rd_party_lib/
scripts/
guild/
guild.yml

So also data and 3rd_part_lib which are the heavy folders.

Refer the example I provided above. You need to explicitly exclude any directories containing large numbers of files - unless you want those scanned for consideration as source code files. This is what’s taking time. The code snippet above will address that.

I see - thank you! It works now :slight_smile:

1 Like