Guild compare / view / tensorboard hangs

I have about 15 runs which where I perform about 60000 steps and log a loss for each step. When I try to view these using guild runs it works fine, but trying to extract the best run using compare or viewing the results using view / tensorboard it loads for a long time until I can actually view the information.

What could be the reason for this? Am I logging too much per run?

EDIT: I realised that @garrett has said elsewhere that the view is due for an overhaul, perhaps that will fix this problem.

Does this long delay occur again and again, or only after the initial view for a run?

Guild caches the log info it reads from the TF summary files. These files can take quite a while to read. But once they’re read, the data is caches and Guild should be very fast.

Views in TensorBoard will continue to take a long time because TensorBoard always reads the data anew.

Unfortunately this is the nature of the TF summary files.

If any of your TF event files are over 1G, you’ll notice this delay and it can be up to several minutes for files over 100G.

The only way around that I’m aware of is to log with less frequency. But 60000 steps is not that much so there might be another issue here.

To try:

  • Does Guild Compare take the same long time after running it twice?
  • Does TensorBoard, when run manually (e.g. run tensorboard --logdir <some_run_dir>, take a long time to show scalars?

Thanks for the response @garrett, will try this!