Guild runs command is slow

Alessandro · February 16, 2024, 12:12pm

Hi there,

in contrast to the other topics I’m talking about the CLI command guild runs.

The problem I’m frequently encountering is that running anything related to guild runs, e.g. guild view, or guild compare is… often slow, sometimes fast again.

For example, running guild runs stop <hash> takes multiple minutes. Having only ~2000 experiments in total.

Running strace reveals a wall of

openat(AT_FDCWD, "/path/to/.guild/runs/2f98245057a97fa68f714334a5927c32/.guild/attrs/started", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=17, ...}, AT_EMPTY_PATH) = 0
ioctl(3, TCGETS, 0x7ffc47fdc640)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "1704721789638391\n", 8192)     = 17
read(3, "", 8192)                       = 0
read(3, "", 8192)                       = 0
close(3)

Stracing guild runs reveals thousands of similar openat syscalls.
Going through the source code I assume this is due to calling _all_runs in var.py.
This function goes through all experiments and checks for their status by searching for files in .guild/attrs (in run.py) for each experiment.

If my understanding is correct, then this this leads to O(runs * files) runtime and is heavily influenced by I/O time. So, always using files instead of the existing .guild/cache database makes this quite slow.

Is this correct? Are there any plans to improve the performance?

Cheers
Alessandro

Topic		Replies	Views
Guild runs very slow Troubleshooting	1	387	September 28, 2022
Debugging and profiling guild Troubleshooting	8	946	December 16, 2020
Guild run hangs / very slow Troubleshooting	2	628	October 8, 2020
Guild compare / view / tensorboard hangs Troubleshooting	2	562	October 26, 2020
Tensorboard logging twice + is slow Troubleshooting	10	2298	November 27, 2020

Guild runs command is slow

Related topics