Logging scalars when running ray[tune] tuning fails in a guild run

In my project I have a bit of automatic tuning of my pytorch-lightning models using ray and then I also automatically apply the model. The logging that ray[tune] uses is a SummaryWriter from tensorboardX package. I am also using tensorboardX SummaryWriter for logging other things in my project. For my own logging, there is no issue with this, but for some reason guild fails with the calls to add_scalar() when it’s called from the tune library.

The trace:

3/8/2021 5:40:52 PM
Traceback (most recent call last):
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 594, in _process_trial
3/8/2021 5:40:52 PM
decision = self._process_trial_result(trial, result)
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 666, in _process_trial_result
3/8/2021 5:40:52 PM
self._callbacks.on_trial_result(
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/ray/tune/callback.py", line 192, in on_trial_result
3/8/2021 5:40:52 PM
callback.on_trial_result(**info)
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/ray/tune/logger.py", line 393, in on_trial_result
3/8/2021 5:40:52 PM
self.log_trial_result(iteration, trial, result)
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/ray/tune/logger.py", line 631, in log_trial_result
3/8/2021 5:40:52 PM
self._trial_writer[trial].add_scalar(
3/8/2021 5:40:52 PM
File "/home/davina/miniconda3/envs/ap/lib/python3.8/site-packages/guild/python_util.py", line 239, in wrapper
3/8/2021 5:40:52 PM
cb(wrapped_bound, *args, **kw)
3/8/2021 5:40:52 PM
TypeError: _handle_scalar() got an unexpected keyword argument 'global_step'

The line from tune in question in full is self._trial_writer[trial].add_scalar(full_attr, value, global_step=step). This fails.

In my own project I have the following line: logger.add_scalar(f"{prefix}/{tag}", scalar_value, global_step, walltime) and this does not fail.

So I went into the ray.tune library and I changed the call to self._trial_writer[trial].add_scalar(full_attr, value, step) and reran it. The failure went away.

I dug into the github source, and it looks like _handle_scalar() is expecting step and not global_step.

I originally needed help with this but as I wrote this I ended up figuring out the answer. Looks like there’s a potential bug here?

Thanks for the report - and my apologies for the late reply!

Taking a look at this now.

In the meantime, I suggest disabling Guild’s output scalar support if you’re not using it. Generally if you’re logging directly using tensorboardX or another TF summary logger, you can turn off output scalars as there’s no point doing any extra work looking at the script output for summaries.

op:
  output-scalars: off

If you have a lot of operations that you need to configure, you can use the operation-defaults attribute of a model:

- model: ''  # or use the model name
  operation-defaults:
    output-scalars: off
  operations:
    op1: {}
    op2: {}

I’ll update here with my findings on the behavior you’re seeing (a bug in Guild somewhere no doubt, just not sure what yet).

Hi, no problem.

Ah I think I had turned off output scalars for my base operation and had a bunch of other operations and assumed that it would percolate up for some reason. I’ll ad that in to all my operations.

Yeah I think that in your _handle_scalars() method, you’ve named the argument step but tensorboardX calls it global_step, so if a call to tensorboardX uses the argument name when being passed in (e.g., global_step=step) it complains because the _handle_scalars() called it step instead of global_step so we get _handle_scalar() got an unexpected keyword argument 'global_step'. That’s my guess, though I may be diagnosing something irrelevant haha

Your assessment is right! I’m cleaning these up now with some tests.

In retrospect this is unrelated to output scalars so please ignore my earlier examples.

If you have defined plugins for the operation in question, you should be able to fix this issue by removing that line or setting it to an empty list. Guild may be auto-detecting something there, so setting this explicitly may solve the issue:

op:
  plugins: []

I’ll update here when a fix is ready.