Pre-processing not updating

What is the usual run time for pre-processing to run on a job? It’s been running for close to 3 hours, and the UI isn’t updating with the runs data (none in pending, running, cancelled, failed or succeeded)…

1 Like

This is most likely the famous bug that we are currently dealing with and @tyler is working on. Can you give us a hint about the project and job id?

1 Like

It’s in our Aurecon company group. Are you able to access these projects?

Hey @compdesignernz ,

Apologies for this being such a persistent issue! It seems you have exactly the kind of workloads that bring this up right now.

For reference, I’ve been trying to replicate this issue on our test cluster so that I can be sure we’re fixing the right thing. I was able to replicate it yesterday and found a few a few nested problems to fix.

  1. We have been using our own fork of our workflow engine which is based on an older release. The maintainers have suggested that the newer versions have fixes for a memory leak issue that large numbers of concurrent runs (~200) can cause in one of the monitoring components.
  2. We have a custom status-monitoring service that reads the output of the workflow engine to update the web UI in real time. The memory leak issue can cause this service to be forcibly restarted into a state where these updates aren’t emitted.

Today, I merged a change to fix some cascading issues that these cause in our event stream. I just finished testing and it does partially resolve the missing status updates. Tomorrow, I will update our workflow engine (and some other ideas I have to prevent this) and use the test cases I set up to ensure that this is indeed the issue and that it is ameliorated.

Until then, we can manually push your jobs to the finished state once they are done. (FYI, they aren’t done right now, I can see them running on the command line :sunglasses: )

My apologies again about this! We’re working as fast as we can to make sure we resolve the underlying issues so that it doesn’t crop up again in the future.

1 Like

Hey, @tyler :wave:

Thanks for the update and the speedy effort! I will be running more jobs that will be in the 200+ run category, so a fix is very much appreciated! My last sim eventually ran, although the time for simulation was a very long time (22 hours) or longer… Looks like we are really putting the tool through its paces :laughing:!

That’s all good! I totally appreciate the effort you all are putting in to addressing the issues that are popping up.

Also, I have put another issue in the forum where loading a job into the load runs component results in GH freezing :confused:

1 Like

Yep! Your jobs have been a great test case for us! Thanks again for using real-world workloads. It’s the only way we can actually build a useful platform.

Good news, I think I’ve finally resolved all of the underlying issues that were causing run updates to go missing. I’ve tested with multiple large parametric jobs running concurrently and have seen them all render their statuses correctly.

I actually duplicated your job to run another test of a large, long-running job tonight. Provided that works as expected, I’ll push the updates out tomorrow and we should be good to go.

Until you break it again :wink:

1 Like

No worries! It saves me time with running months worth of simulations, so even with breakages, it is still faster :laughing: ! That’s great news! I have maybe 10 more jobs to run, so we will see how it goes!

Haha, I’ll do my best to make sure the tool is resilient :wink: . With the updates, I am very impressed at how quick things are!

1 Like

Update:
I just merged the changes that I had been working on to resolve this tonight. My results on our test cluster are promising. I was able to replicate one of your jobs @compdesignernz which had been an issue before and have it generate all of the necessary updates to show up as complete in the browser app multiple times with other large parametric jobs running concurrently

Would you be able to run some more large parametric jobs and let me know here how it goes? If there are any lingering issues, I’m sure you’ll find them :smile_cat:

1 Like