1 failed out 144 runs. Why and is it possible to re-calculate?

adaaaaam · April 20, 2023, 6:15pm

Hey everyone!

I’ve been loving the convenience of pollination cloud computing for handling intensive simulations! However, we recently encountered an issue during our first full test with custom-energy-sim 0.3.17. Out of the simulations we ran, 143 were successful, but 1 failed.

The logs from the failed run appear unusual, as the simulation seems to have halted unexpectedly without any error messages (I’ve attached the logs after the warm-up below).

Initially, I thought the issue could be due to the simulation exceeding the limits for a single run, but the successful runs were essentially using the same model, with only minor parameter adjustments.

Interestingly, I discovered that some successful runs had also encountered the same issue, but their debug pages indicated that new simulations were started and eventually succeeded.

Could anyone provide some insights on this issue? Additionally, is there a way to re-run the failed simulation within the study instead of creating a new one? The post-processing can get quite messy when trying to combine results from different studies.

Any assistance would be greatly appreciated. Thanks in advance!

143
Starting Simulation at 01/01/2006 for RUN PERIOD 1
144
Updating Shadowing Calculations, Start Date=01/31/2006
145
Continuing Simulation at 01/31/2006 for RUN PERIOD 1
146
Updating Shadowing Calculations, Start Date=03/02/2006
147
Continuing Simulation at 03/02/2006 for RUN PERIOD 1
148
Updating Shadowing Calculations, Start Date=04/01/2006
149
Continuing Simulation at 04/01/2006 for RUN PERIOD 1
150
Updating Shadowing Calculations, Start Date=05/01/2006
151
Continuing Simulation at 05/01/2006 for RUN PERIOD 1
152
Updating Shadowing Calculations, Start Date=05/31/2006
153
Continuing Simulation at 05/31/2006 for RUN PERIOD 1
154
Updating Shadowing Calculations, Start Date=06/30/2006
155
Continuing Simulation at 06/30/2006 for RUN PERIOD 1
156
Updating Shadowing Calculations, Start Date=07/30/2006
157
Continuing Simulation at 07/30/2006 for RUN PERIOD 1
158
Updating Shadowing Calculations, Start Date=08/29/2006
159
Continuing Simulation at 08/29/2006 for RUN PERIOD 1
160
Updating Shadowing Calculations, Start Date=09/28/2006
161
Continuing Simulation at 09/28/2006 for RUN PERIOD 1
162
Updating Shadowing Calculations, Start Date=10/28/2006
163

mostapha · April 20, 2023, 6:39pm

Hi @adaaaaam!

This is one of those comments that make it all worth it! Thank you! I’m glad that it is helping you with your studies.

Your observations are correct except that the reasons that the two runs have been canceled are different. We have default retries for cases where a Pod gets deleted or there is a network issue that creates an error but not all of them are marked the same as errors. I have noticed this issue and have been working with the Pipekit team to find a solution to retry the failed cases except for the ones that are because of memory issues that will fail even if we re-run them again.

Fortunately, there is an option. We haven’t exposed it but I think we should.

I re-ran the failed run for you under the same study. Can you check and confirm that everything looks good on your end? Hopefully, with the fix for retry and exposing the option to manually retry the cases that might fail you won’t need my help on this issue anymore!

adaaaaam · April 20, 2023, 6:46pm

Hi @mostapha !

Thank you sooooo much for your quick reply and fast fix!
Yes I saw the changes and all the runs are completed now .

adaaaaam · April 20, 2023, 7:45pm

Hi @mostapha , I’m afraid I have to bother you again…

Tried to open the study page to check the detail of the run you just re-computed but I got this loading sign all the time. But the workspace looks fine I can access files there.

Besides, I was also tring to get the data from Pollination cloud but when I came to the previous failed run I still get 1. Error calling ListRuns: Internal Server Error

I tried to re-login the web page and GH component as well as reboot my computer but didnt work… I guess it could be something on the server?

mostapha · April 20, 2023, 7:51pm

It is one me! Let me see what I have done wrong. I’ll keep you posted.

mostapha · April 23, 2023, 3:10pm

This issue is resolved via messages.