Debugging and profiling guild

copah · December 14, 2020, 5:06pm

I sometimes see guild taking a long time to start a run compared to just running the command that I get from --print-cmd. I realise this is because guild has to resolve dependencies etc., but I would like to understand if there is an easy way to debug and especially profile what steps / operations that is expensive in the guild command.

I am aware of the guild --debug flag, but in my particular case it doesn’t provide much info about what is taking a long time.

garrett · December 15, 2020, 7:47pm

Run your command with PROFILE=1 env var like this:

PROFILE=1 guild run ...

You’ll get a couple of Python profile stats written.

A nice way to view these files is with SnakeViz. Guild prints the instructions for running snakeviz when you run with the profile flag.

If you need help interpreting anything just attach both stat files and I’ll take a look.

The shorter you can make your code the better. Otherwise the startup cost/time will be overshadowed by the actual run time.

copah · December 15, 2020, 10:33pm

That’s pretty sweet!

I’ve profiled the run that takes a long time. You can find the results here.

As you can see it takes a looong time. This happened after I did a big refactoring of my guild file into multiple guild files using inheritance.

Is the above happening because it is looking for source code in the root dir?

garrett · December 15, 2020, 10:55pm

It looks like you have a directory with a lot of files - over 1M. Guild is example those files to see if they’re candidates for source code copy. By default Guild only looks at I think around 100 files unless you’ve configured the sourcecode attr for the operation.

You can see what’s going on by running:

guild run <op> --test-sourcecode

This should take all that time but you’ll see where the files are.

You can remove a directory from consideration (Guild won’t scan it) this way:

op:
  sourcecode:
    - exclude:
        dir: <dir containing lots of files>

copah · December 15, 2020, 11:30pm

So I have the sourcecode attribute specified, but I guess not it in the right way.

My current folder structure looks like this:

training/
scripts/
guild/
   flags/
           classification.yml
           common.yml
           segmentation.yml
    base_model.yml
    classification_model.yml
    segmentation_model.yml
    utils.yml      
guild.yml

My main guild.yml looks like this:

- include: guild/segmentation_model.yml
- include: guild/classification_model.yml

These two guild files in turn looks like this:

guild/segmentation_model.yml
-----------------------------
- include:
    - base_model.yml
    - utils.yml
    - flags/segmentation.yml
    - flags/common.yml

- model: segmentation_model
  sourcecode:
    - scripts
    - training
    - guild.yml
  extends:
    - base_model
    - utils
  operations:
    convert_to_onnx:
      flags:
        $include: onnx_flags_segmentation
    train:
      flags:
        batch_size: 1
        $include:
          - segmentation_flags
          - train_flags
          - common_flags
    test:
      flags:
        $include:
          - common_flags
          - test_flags

guild/classification_model.yml
-------------------------------
- include:
    - base_model.yml
    - utils.yml
    - flags/classification.yml
    - flags/common.yml

- model: classification_model
  sourcecode:
    - scripts
    - training
    - guild.yml
  extends:
    - base_model
    - utils
  operations:
    train:
      flags:
        batch_size: 0
        $include:
          - classification_flags
          - train_flags
          - common_flags
    convert_to_onnx:
      flags:
        $include: onnx_flags_classification

The base_model.yml looks like this:

base_model.yml
--------------------------------
- config: base_model
  sourcecode:
    select:
      - scripts
      - training
      - guild.yml
  operations:
    train:
      main: scripts/training/train_model --input_database ...
      requires: prepared_data
    test:
      main: scripts/training/test_model --input_database ...
      requires:
        - operation: train
        - prepared_data
  resources:
    prepared_data:
      sources: ...

The command

guild run classification_model:train

Is what takes a long time. The interesting thing is that the sourcecode directory in the run directory only contains the sourcecode that I have specified.

$ ls ~/.../.guild/runs/5d239a67d97d4bd4952e2b1cc2b10083/.guild/sourcecode/
guild.yml  scripts  training

garrett · December 15, 2020, 11:34pm

It’s the scanning/testing of a large number of files that’s taking time.

What does this command reveal?

guild run classification_model:train --test-sourcecode

copah · December 15, 2020, 11:35pm

It scans through the entire directory:

training/
data/
3rd_party_lib/
scripts/
guild/
guild.yml

So also data and 3rd_part_lib which are the heavy folders.

garrett · December 15, 2020, 11:37pm

Refer the example I provided above. You need to explicitly exclude any directories containing large numbers of files - unless you want those scanned for consideration as source code files. This is what’s taking time. The code snippet above will address that.

copah · December 16, 2020, 12:39am

I see - thank you! It works now

Topic		Replies	Views
How do I test my Guild file? Troubleshooting	1	686	October 28, 2020
Debugging Guild Operations in PyCharm Tips	10	1603	July 31, 2022
Guild Files Concepts	0	4898	June 12, 2020
Tracking source code that is a python package Troubleshooting	2	318	February 26, 2022
Running Guild from Python code General	2	684	August 3, 2020

Debugging and profiling guild

Related Topics