Remote stop not working

After running guild runs stop X -r server, the processes are still running on the remote and GPU memory has not been released even though guild reports the run as terminated.

I think it may be pytorch’s data loader worker processes that are still running but I’m not sure. I think I saw something about this subject before but I couldn’t find it here or on github. Has anyone else experienced this issue?

I wonder if Python’s multiprocessing module is in play here. 0.7.3 has some rework of the stop function and the SIGTERM might not be propagating to the child processing.

I’ll see if I can replicate this with a simple multiprocessing example. This wouldn’t need PyTorch.

Ref to GitHub issue: 0.7.3 stop not working with PyTorch data loader · Issue #281 · guildai/guildai · GitHub