Restarting multiple runs

emgong · July 26, 2020, 8:28am

Hi,
As far as I can tell there is not way to re-start multiple runs. The scenario I have is that I have started a batch run (which staged a few hundred of runs), then the machine terminated when most runs did not complete yet - and I wanted to re-run all staged runs (and the non-batch run that was terminated).

Using guild run --restart $(guild select --staged) only runs the latest staged run.

I have worked around it by using a bash script that looped over all staged run ids, and calls guild run --restart $short_id separately to each run.

Is there a simpler way of doing that? If not I would raise this as a feature request. And also perhaps adding a more specific feature to “resume” a batch run (that would automatically only restart staged and terminated runs that where originated from this batch run).

garrett · July 26, 2020, 3:14pm

You can use a queue.

If you have a bunch of staged runs that you want to run in a single pass (sequentially) use:

guild run queue run-once=yes

If you don’t specify run-once there it’s no big deal but the queue continues to run looking for staged runs.

Regarding using a batch to manage a restart, now looking at how --restart behaves with the default batch operation, it’s wrong I think. In fact I think it’s a regression. It used to work by actually restarting the trials it had previously staged. Now it generates new runs. The current behavior for batch restart is actually what --proto is for.

So I think we can call that a bug — or at design oversight.

If you could open an issue for that it would be very helpful. This would be in the “resume batch” spirit that you mention. (Note to self: I think run will need another option to tell the batch whether or not to restart completed trials, in addition to staged, terminated, or errors. This could alternatively be a batch op flag. I can’t imagine it would be used often.)

But in the meantime, the queue will run the staged runs.

garrett · July 26, 2020, 9:33pm

I want to capture another issue here. Your instinct to pass multiple run IDs to run --restart I think is quite right. The run command currently designed to restart only one run but this could be enhanced.

guild run --restart abc123 def456 ghi789

Then this would work:

guild run --restart `guild select --staged :`

Note the colon arg : there. By default select returns one run ID. We’d need to support returning many but I think this would have to be explicit using a range selector. Note this could also be spelled as 1:.

garrett · July 26, 2020, 9:35pm

I opened an issue for the restart topic:

emgong · July 27, 2020, 12:27pm

Thanks a lot.

I was thinking the queue may help, but the documentation was not clear enough regarding how to use it, and it also will stop only after running all staged runs. Also, I wasn’t sure what will happen if I will start to run a queue while another batch run is running - would there be a competition about who will run the staged runs that the batch run had created?

Anyway, I think I will use the queue option now that I understand it better until this feature will be supported. Thanks.

Also, I completely agree with the fact that guild select should have an option to return more than a single run id. Actually, I was very surprised that it returned only a single run id when I ran it - I would expect returning all runs matching the given criteria will be the default.

BTW. I didn’t completely follow the relation to the --proto option.

garrett · July 27, 2020, 2:24pm

Re queues, I was going to write you here to say that queues will not start trials that are managed by a batch run — but I just tested this and that’s not the case. In 0.7.0 queues in fact do compete with batches that stage trials. This is a bug that will be fixed in 0.7.1!

You can use --stage-trials to avoid these race conditions.

In looking over the Queues docs it looks like the info is covered but as you say it’s not clear how to use them. I wonder if an end-to-end example (e.g. a How To guide) would make things clearer.

The --proto is similar to --restart but it tells Guild to start a new run rather than restart an existing run. The run specified with --proto is used as the prototype (or “template”) for the new run. So the operation, flags, source code, etc. are all used by default to start the new run. You can change flags and use --force-sourcecode to use the current working code rather than the prototype’s.

The most common use case --restart is when a run terminates early. You want to literally restart the run in place. Restartable runs need to check for interim save-points and re-initialize their states as needed on restart (Guild doesn’t do anything like that automatically). So the driver for using --restart is that you have some saved stated in the run directory that you want to use.

I think --proto is less commonly used. I use it when I want to start a run using the flag values from a previous run but I’m too lazy to look up the values and re-type them. In that case I’d use --force-sourcecode to make sure that my current source code is used. Another common case is to tweak some flag values but make sure the source code doesn’t change. In that case you do not use --force-sourcecode.

garrett · July 27, 2020, 3:50pm

@emgong FYI I created an issue resolution doc to recreate the problem and to also verify the fix when it lands.

emgong · July 28, 2020, 5:19am

Thanks. Regarding the Queue documentation I think it lacks some details (e.g., the run-once option is not even mentioned there nor in the documentation of Run). And yes, I believe an end-to-end how-to guide will help a lot - for queues and for all other features

The description you gave regarding --proto is very clear and I believe I understand it completely now. In my personal opinion, I don’t think that adding the --force-sourcecode is a good design choice. I think it makes the usage of this feature ambiguous (with respect to its purpose), more complicated, and dangerous (you could easily pass flags that are no longer relevant, or ignore flags that are important to pass with the new code). I completely understand why it could save time though, just think being explicit and clear is more important in this case (especially when you have the configuration files that already save you a lot of time).

garrett · July 28, 2020, 10:42am

Thank you! I have a note to fill in the missing pieces of queues and provide a guide.

The use of --force-sourcecode is indeed a bad idea if your use case is to test new flags on the same source code. And that is a common case. But it’s also common to test a code change against the same flags. I wouldn’t call anything dangerous as long as you track it accurately. If flags change that much, you probably don’t want to use --proto in the first place.

emgong · July 28, 2020, 11:44am

Yes, I agree. I probably exaggerated when I called it “dangerous”.
Thanks.

Topic		Replies	Views
Guild Stage-Trials Rerun on Error Troubleshooting	1	100	June 18, 2024
Stage trials error Troubleshooting	3	552	January 15, 2022
Restart behavior of Guild steps (pipelines) RFC	0	243	March 13, 2023
RFC: Auto-delete batch runs on success General	0	487	February 23, 2021
Command: runs restore Commands	0	1198	June 10, 2020

Restarting multiple runs

Related topics