Tensorboard logging twice + is slow

I have an operation, let’s call it a that is kinda the “base operation” that I could run a number ways with different flags. Then I have another operation b that has one step that runs operation a with the proper flags.

What’s strange to me is that I seem to get multiple of everything I logged with tensorboard. Additionally, guild view is pretty slow to open my runs, and viewing in tensorboard from there is much slower (takes 10+s) . I saw someone had an issue with symlinks but I don’t think that’s the issue here since I don’t have any set up. I feel like these issues are probably linked, considering I wasn’t having this problem before. If I run just 1 configuration I ended up with:

Not sure how to go about fixing this, I probably did something wrong.

The duplication in results in TensorBoard is by design. Steps (pipeline) ops inherit the scalar results of their child runs. To avoid this, use a filter to limit the runs that you’re viewing in TensorBoard. E.g. to view just the ap results (which include imputer) run:

guild tensorboard --operation ap

Re slow performance in View, you’re right this is not likely an issue with symlinks (you’d have seen poor performance with guild tensorboard as well in that case – though this has been fixed).

Guild View is missing some optimizations, which cause it to slow down when viewing more and more runs. When running View in this case, you can work around this by limiting the runs you show. Use filters, e.g.

But to address the issue with TensorBoard performance, I recommend just running guild tensorboard and not launching via View.

View is due for a series of enhancements including performance optimization — but until then these are the workaround I recommend.

1 Like

I face the same issue but I am unsure how to resolve it. I have an operation called start which I run when I start an experiment. For every run, there are two entries in Tensorboard. How can I filter out one of them?

guild tensorboard --operation start doesn’t work.

Do you mind sharing some details where the second entry in Tensorboard is coming from?

Screenshot 2020-11-25 at 13.03.53

Is it possible that your code is also writing to the tensorboard in a directory under the working directory or a directory you requested in the resources?

This is the way I log results right now:

torch.utils.tensorboard import SummaryWriter
self.writer = SummaryWriter()
self.writer.add_scalar('Loss/train', loss_train/(batch_idx+1), global_step

Does this create the duplicate entry in tensorboard?

From here it says that by default it writes to ./runs/ so I guess this is most likely whats happening.

What I recommend is to create a static directory on your machine and write all runs to that directory (of course you need to clean it yourself from time to time otherwise it could bleed your hardisc dry in no time).

And then you change your line to

self.writer = SummaryWriter('/some/static/path/')

Or better still, use env variable to specify where to write it so you track your experiment better:

TENSORBOARD_LOGDIR=/some/dir/;guild run XXX
import os
self.writer = SummaryWriter(TENSORBOARD_LOGDIR)

Ok it since this static logging directory is not inside the run folder it won’t be shown on tensorboard, right?

Correct, as long as it not under guild’s home directory (usually ~/.guild).

Ok the only purpose of logging to the static dir is to trigger guildai?
These seems a bit hacky.

Actually the purpose is to NOT trigger guild’s tensorboard so you don’t show duplicate items in it.

Generally, you want something to be logged by guild, and you write those into the flags and scalars in your .yml file so that they will show up on the tensorboard and there’s something that’s more complicated and guild doesn’t support yet like network parameters histograms so that you write them into another tensorboard logdir that you manage yourself.

If you want to write your own data output to the same tensorboard event that guild manage, you can do so by specifying the logdir to be the .guild folder in a run sub-directory. You can do that by:

self.writer = SummaryWriter('./.guild/')

You can also understand more about guild run file hierarchy by browsing the run sub-directory, try guild ls and explore the it. Note that under the run sub-directory, there’s a hidden directory .guild that hold the source code, the tensorboard event and the environmental variables…etc.