Tensorboard logging twice + is slow

davzaman · October 20, 2020, 5:55am

I have an operation, let’s call it a that is kinda the “base operation” that I could run a number ways with different flags. Then I have another operation b that has one step that runs operation a with the proper flags.

What’s strange to me is that I seem to get multiple of everything I logged with tensorboard. Additionally, guild view is pretty slow to open my runs, and viewing in tensorboard from there is much slower (takes 10+s) . I saw someone had an issue with symlinks but I don’t think that’s the issue here since I don’t have any set up. I feel like these issues are probably linked, considering I wasn’t having this problem before. If I run just 1 configuration I ended up with:

Not sure how to go about fixing this, I probably did something wrong.

garrett · October 20, 2020, 2:41pm

The duplication in results in TensorBoard is by design. Steps (pipeline) ops inherit the scalar results of their child runs. To avoid this, use a filter to limit the runs that you’re viewing in TensorBoard. E.g. to view just the ap results (which include imputer) run:

guild tensorboard --operation ap

Re slow performance in View, you’re right this is not likely an issue with symlinks (you’d have seen poor performance with guild tensorboard as well in that case – though this has been fixed).

Guild View is missing some optimizations, which cause it to slow down when viewing more and more runs. When running View in this case, you can work around this by limiting the runs you show. Use filters, e.g.

But to address the issue with TensorBoard performance, I recommend just running guild tensorboard and not launching via View.

View is due for a series of enhancements including performance optimization — but until then these are the workaround I recommend.

jan_r · November 25, 2020, 12:06pm

I face the same issue but I am unsure how to resolve it. I have an operation called start which I run when I start an experiment. For every run, there are two entries in Tensorboard. How can I filter out one of them?

guild tensorboard --operation start doesn’t work.

Do you mind sharing some details where the second entry in Tensorboard is coming from?

Screenshot 2020-11-25 at 13.03.53

teracamo · November 26, 2020, 10:51am

Is it possible that your code is also writing to the tensorboard in a directory under the working directory or a directory you requested in the resources?

jan_r · November 26, 2020, 11:05am

This is the way I log results right now:

torch.utils.tensorboard import SummaryWriter
self.writer = SummaryWriter()
self.writer.add_scalar('Loss/train', loss_train/(batch_idx+1), global_step

Does this create the duplicate entry in tensorboard?

teracamo · November 26, 2020, 11:32am

From here it says that by default it writes to ./runs/ so I guess this is most likely whats happening.

What I recommend is to create a static directory on your machine and write all runs to that directory (of course you need to clean it yourself from time to time otherwise it could bleed your hardisc dry in no time).

And then you change your line to

self.writer = SummaryWriter('/some/static/path/')

Or better still, use env variable to specify where to write it so you track your experiment better:

TENSORBOARD_LOGDIR=/some/dir/;guild run XXX

import os
TENSORBOARD_LOGDIR=os.environ('TENSORBOARD_LOGDIR')
self.writer = SummaryWriter(TENSORBOARD_LOGDIR)

jan_r · November 26, 2020, 11:48am

Ok it since this static logging directory is not inside the run folder it won’t be shown on tensorboard, right?

teracamo · November 26, 2020, 11:53am

Correct, as long as it not under guild’s home directory (usually ~/.guild).

jan_r · November 26, 2020, 1:48pm

Ok the only purpose of logging to the static dir is to trigger guildai?
These seems a bit hacky.

teracamo · November 26, 2020, 2:19pm

Actually the purpose is to NOT trigger guild’s tensorboard so you don’t show duplicate items in it.

Generally, you want something to be logged by guild, and you write those into the flags and scalars in your .yml file so that they will show up on the tensorboard and there’s something that’s more complicated and guild doesn’t support yet like network parameters histograms so that you write them into another tensorboard logdir that you manage yourself.

If you want to write your own data output to the same tensorboard event that guild manage, you can do so by specifying the logdir to be the .guild folder in a run sub-directory. You can do that by:

self.writer = SummaryWriter('./.guild/')

You can also understand more about guild run file hierarchy by browsing the run sub-directory, try guild ls and explore the it. Note that under the run sub-directory, there’s a hidden directory .guild that hold the source code, the tensorboard event and the environmental variables…etc.

garrett · November 27, 2020, 4:30pm

Just to add a bit to the excellent info @teracamo provides…

TensorBoard doesn’t really know about “runs” in the Scalars plugin. It enumerates unique directories that contain TF event files under it’s log directory. It calls them runs in the UI, but it has not idea what a run is. In fact, it’s quite common to log events under separate subdirectories for a run to help organize the layout in TB. E.g. you’ll see “train” and “validate” or “eval” subdirs used to separate scalars.

The reason you’re seeing two separate “runs” there is that there are TF event files landing in separate subdirectories. That’s confusing. It’d be better if TB used a term other than “runs”. Alas, that’s the way they present the info.

The somewhat odd appearance of <run dir>/.guild in this list is because Guild writes its TF event logs in a subdirectory .guild. This is to avoid possible collisions with any files that your script runs. As @teracamo says, it’s sometimes helpful to poke around this directory to see what Guild saves with your run. You don’t need to worry too much about it, but it’s there in plain view if you ever need to understand something in more detail.

Okay, to the problem at hand! As I see it there are three options to address the point of confusion:

Don’t worry about it. It’s okay to have multiple subdirs in TB associated with a run. Look at the two runs and in your head say, “one run, one run” until the problem resolves itself
If you’re already logging scalars, you don’t really need Guild’s output scalar support. You can disable it for a single operation this way:

op:
  output-scalars: no

Alternatively, use the operation-defaults model attr in “full format” Guild file.

- operation-defaults:
    output-scalars: no
  operations:
    op: ...

This eliminates the .guild entry in the runs list in TB. That’s simple enough but you’ll be responsible for logging scalars. Since you’re doing that already, I think this is a pretty good option.

Write your summaries to <run dir>/.guild. This will consolidate the summaries you write with the summaries that Guild writes. I personally don’t like this option and would discourage it. I think your TF event files should land wherever you want them — root of the run dir or a subdirectory. That’s a pretty standard convention in TensorFlow land and writing to .guild is a bit unconventional.

I was hoping for an output-scalars attribute that let you write to a different directory but 0.7.0 doesn’t support this. I think that’d be a good option 4. Something like this:

- op:
  output-scalars:
    summary-path: .  # This is hypothetical - Guild 0.7.0 does not support this

Long winded responses but hopefully it gives you some useful background.

Topic		Replies	Views
Guild compare / view / tensorboard hangs Troubleshooting	2	562	October 26, 2020
Tensorboard taking long to startup Troubleshooting	6	1882	July 4, 2020
Command: tensorboard Commands	0	5335	June 10, 2020
TensorBoard Tools	0	2893	June 12, 2020
Using an alternative to Tensorboard General	6	1074	March 7, 2022

Tensorboard logging twice + is slow

Related topics