Strange out-of-memory behavior on Guild with XGBoost

Hi all, I’m using Guild to manage and tune XGBoost models for a binary classification problem. My dataset is around 2MB and has around 20K rows of 15 features. My XGBClassifier has around 100 estimators, the max depth is 6, and the tree method is gpu-hist.

When I run my program in VS Code it executes with no problems. When I run the same program in the command line with Guild, it also finishes without throwing any error. But when I look at the run in guild view or guild runs, it says that the run exited with an error status 3221226505.

Online sources say this generic python error is some form of out of memory error. However, this can’t be the case as I monitored the RAM and VRAM usage while executing my program and they were both very low.

When I switch the tree method to just hist (cpu-only training) and re-run the program, guild view now shows the run as completed.

May I know if this is a bug? My GPU is a Quadro T1000, my XGBoost version is 1.6.2, and here is part of the guild check output:

guild_version:             0.8.1
python_version:            3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]                            
platform:                  Windows 10 AMD64                                                                               
psutil_version:            5.9.2
tensorboard_version:       2.10.1
cuda_version:              11.7
nvidia_smi_version:        516.69
latest_guild_version:      0.8.1   

Thank you!

These resource-related issues can be tough to track down.

Do you see this exit code consistently (i.e. every time) when you run with Guild? These OOM issues are sometimes intermittent.

A solid way to test for this is to reboot your system (sorry, I know that’s a common trope!) and then to immediately run the offending command in Guild. This would provide an environment that might have more available memory.

If Guild consistently triggers this, you can disable some potential causes using a few environment variables:

set GUILD_PLUGINS= 
set LOG_INIT_SKIP=1 
set NO_RUN_OUTPUT=1
guild run ...

With this , Guild is just passing through the command to Python and will not do any additional work. If there’s a memory related issue with Guild, that should fix it. If that works, please let me know and we can further track the issue down.

Hi @garrett, thanks for the suggestions!

Yes, every time I use the gpu_hist tree method and run my experiment with Guild, the issue persists. (i.e. the exit code doesn’t appear on the console output, but appears in guild view and guild runs. If I run the experiment with a hyperparameter optimizer, the error does get logged to the console though) Rebooting and running the command straightaway/Setting the environment variables also didn’t work.

I put a print statement at the end of my program and ran it in VSCode as well as Guild. In VSCode, the program exits almost immediately after the print. In Guild, there is a significant delay of around 4 seconds after the print before the program exits.

Perhaps the issue is in Guild’s cleanup? Or is there an environment variable I can set to disable that too?

You may be running into the same behavior when you run from the command line or from the VS Code terminal. Exit codes aren’t printed anywhere so you could be getting a resource-related exit without knowing it, regardless of whether Guild is used.

If you running from the Windows prompt (I recommend this, as opposed to a VS Code terminal, at least for this test), run this after running your code with Python directly (don’t use Guild):

echo %errorlevel%

If you get a non-zero exit code here, it’s your code — something’s eating up something when you use gpu-hist.

I just tried this. echo %errorlevel% returns 0 after I run the Python file directly in the command line.

How are you running the operation with Guild? Are you running a Python script directly or are you using a Guild file with a main spec for the operation?

You can remove Guild’s Python support from the mix by running your operation this way:

test:
  exec: python .guild/sourcecode/test.py
  output-scalars: off
  plugins: off

This is as close to a pass-through to your code as possible.

Hi @garrett, sorry for the late response!!

I am using a Guild file in the operation-only format. My XGBoost operation looks like this (I only included 2 of my flags, I have a total of 12)

XGBClassifier:
  description: Train XGBClassifier on dataset
  main: XGBClassifier
  flags:
    N_ESTIMATORS:
      type: int
    MAX_DEPTH:
      type: int
  output-scalars:
    AVG_ACC: 'AVG_ACC: (\value)'
    AVG_SENS: 'AVG_SENS: (\value)'
    AVG_SPEC: 'AVG_SPEC: (\value)'

Then I run the experiment with

guild run XGBClassifier

After trying your suggestion, the run now shows up as ‘completed’ in guild runs . When I restore the output-scalars and remove plugins: off, there is also no issue. So the problem likely stems from main: XGBClassifier and the large number of flags.

However, I also noticed that the console started to output my print statements in a very jerky manner once I implemented your suggestion. The text would print, but it would intermittently appear as a lot of lines at one shot, instead of coming out line by line.

Thank you!

Hi @garrett , sorry for the trouble, may I know if there’s any update on this? Thank you very much!