Timeout error

Hi @garrett,

I can’t figure out why sometimes I get timeout error when I run experiments. Here is my running command:

guild run train tb_volatility_lookback=range[30:300:10]

it hangs after some time and when I click ctr + c to abort it the below error shows up.

Traceback (most recent call last):
File “C:\ProgramData\Anaconda3\lib\runpy.py”, line 193, in _run_module_as_main
INFO: [numexpr.utils] NumExpr defaulting to 8 threads.
main”, mod_spec)
File “C:\ProgramData\Anaconda3\lib\runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\batch_main.py”, line 38, in
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\batch_main.py”, line 26, in main
batch_util.handle_trials(batch_run, trials)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\batch_util.py”, line 54, in handle_trials
_run_trials(batch_run, trials)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\batch_util.py”, line 79, in _run_trials
_start_trial_run(run, stage)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\batch_util.py”, line 117, in _start_trial_run
run_impl.run(restart=run.id, stage=stage)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1940, in run
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1017, in main
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1101, in _dispatch_op
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1286, in _dispatch_op_cmd
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1354, in _confirm_and_run
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1544, in _run
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1575, in _run_local
_run_op(op, S.args)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\commands\run_impl.py”, line 1683, in _run_op
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op.py”, line 160, in run
exit_status = _run(run, op, quiet, stop_after, extra_env)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op.py”, line 195, in _run
exit_status = _op_wait_for_proc(op, proc, run, quiet, stop_after)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op.py”, line 230, in _op_wait_for_proc
return _op_watch_proc(op, proc, run, quiet, stop_after)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op.py”, line 238, in _op_watch_proc
return _proc_wait(proc, stop_after)
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op.py”, line 259, in exit
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op_util_legacy.py”, line 254, in wait_and_close
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op_util_legacy.py”, line 219, in close
lock = self._acquire_output_lock()
File “C:\ProgramData\Anaconda3\lib\site-packages\guild\op_util_legacy.py”, line 232, in _acquire_output_lock
raise RuntimeError(“timeout”)
RuntimeError: timeout

Is this project generally available where I could try to reproduce this?

This is a deadlock, which is typically hard to track down because it’s hard to recreate. I’d typically ask for a simplified version, but chances are good that as you simplify the project, the issue goes away.

If you’re not able to share the project, we’ll need to come up with a strategy for debugging this.

I should be easy reproducible. Here is my modul https://github.com/MislavSag/trademl/tree/master/trademl/modeling.

In the root the following command should be executed:
guild run --yes --max-trials 150 train labeling_technique=[trend_scanning] tb_volatility_scaler=[1,1.5,2] tb_triplebar_num_days=[10,20,30,50] tb_triplebar_min_ret=[0.04,0.05,0.06] tb_volatility_lookback=range[30:300:10] sample_weights_type=[returns,time_decay]

EDIT: the data is problem! I have data locally. I plan to put it on SQL…

So unfortunately right now it is not possible to reproduce.

I created a fork of this project and made some changes so I could run up to the point where the train script needs the h5 data.

If you can recreate this problem with sample data that you can sent me, I’ll see if I can recreate it.

As a work around for this problem, try setting NO_RUN_OUTPUT=1 when you run the operation.

NO_RUN_OUTPUT=1 guild run train tb_volatility_lookback=range[30:300:10]

It worked when I used lower number of flags. Maybe there was a problem with one grid search (one flag contained range function), but it works now. I will try to debug it by adding new flags ans see on which one it stops working.

Threading problems are notorious for suddenly appearing and suddenly disappearing. They are highly nondeterministic and therefore hard to track down and fix.

My bet is that you’ll run into this again.

I think the best course would be to produce some sanitized test data (nothing private) that, along with the full set of flags, reproduces the deadlock. With that, hopefully, I can reproduce it on a system. With that I can fix it. Short of that, it’ll be tough.

Otherwise, I suggesting setting NO_RUN_OUTPUT=1 in your environment to make sure Guild never runs the code in question. With that you can safely run the script without risk of it deadlocking. You can write your output to logs as needed.

@garrett, when I run NO_RUN_OUTPUT=1 option,
NO_RUN_OUTPUT=1 guild run --yes --max-trials 32 random_forest_sklearnopt:train num_threads=4 tb_volatility_lookback=[50,200] ts_look_forward_window=[600,1200,2400,4800] sample_weights_type=[returns,time_decay]

I get the error:

NO_RUN_OUTPUT=1 : The term 'NO_RUN_OUTPUT=1' is not recognized as the name of a cmdlet, function, script file, or opera
ble program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ NO_RUN_OUTPUT=1 guild run --yes --max-trials 32 random_forest_sklearn ...
+ ~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (NO_RUN_OUTPUT=1:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

This looks like PowerShell. Consult the docs for setting env vars in that shell. This doc from Microsoft seems to apply.

I set it like:

It works fine on first try. I will see if there will be problems in the future.

1 Like