Some runs taking much longer than others

Hi @mostapha and @antoinedao,

I’ve added you both to the Aurecon org, to the project within which I have the following problem when running a parameter study.

Running a single model locally takes just 3.7min. My study consists of 144 variations that should all take about the same time; I’m varying things like EPW file, SHGC, ventilation air change rate, i.e. not adding complexity to the simulation.

This is where it is currently at:
image

Of the 75 succeeded ones, they all completed within the first 33min or so (give or take 1 or 2min), while the 48 that are currently running have been doing so for almost 2h, and there appear to be 21 that haven’t started yet.

Are you able to tell whether this is an issue with the way I set up the study, with Pollination, or with the display on the web app?

Hi @Max :wave:

I will be looking into this issue tomorrow morning. In the meanwhile could you check the status of the job now to see whether more progress has been made or whether it is still stuck? Could you also give the the Job ID? This will allow me to check how it updated its progress and how it was executed on the backend.

Out of curiousity, what was your expectation of how long this job should take and what was your reasoning? It would be very helpful for us to understand what led you to feel that two hours was too long to execute 144 parallel executions of a recipe that would normally take ~4mins locally per execution.

We have introduced account wide CPU limits (100 CPUs per account) during early access to ensure our infrastructure can cope with the competing resource needs of all users on the platform (and that we don’t run out of business because our free early access is too generous :sweat_smile:). This might be contributing to the slower than expected performance on the cloud. Regardless, I’ll do a manual check of your job tomorrow once you send me the job ID :raised_hands:

Cheers,

Antoine

Hi @antoinedao

This is the current status:
image
Job ID: 6e4c0e99-9250-474d-b16d-e555a79fdfec

I sent off another one a couple of hours after I started that one, which had the same resolution but has finished in the meantime:
image
Job ID: d5f7f3a8-386f-48cc-a7cf-c559eacb3b5c

Mostapha mentioned that there were resourcing limitations but I wasn’t aware that it was 100 CPUs per account. Without the limit, I would expect 100 simulations to take about as long as 1 local simulation (plus some extra overhead) - is that unrealistic?

With the limit, I still find it strange that it would take 8h, let alone 17h+. When I run an energy model locally, is it not that case that it is run on a single core? In which case, when running 144 simulations on the cloud on 100 cores, shouldn’t it take 8min + overhead?

2 Likes

Hi @max :wave:,

Tanks for sending over these details. I am looking into this failure to update issue more closely this week and expect I will get back to you next week regarding a timeline for a fix/better explanation of the failure mode.

I will also be looking into this difference in execution speed vs your expected duration for these runs and hopefully get back to you next week.

Thanks again for being patient with us :pray:

Cheers,

Antoine

1 Like

All good, I understand that it is a work in progress. Thanks for the quick responses!

1 Like

@antoinedao, just letting you know that I sent off another job yesterday and one today - both with the same number of runs (144) and only slight changes to the model that shouldn’t have affected the simulation time - and yet they both only took about 20min each (instead of 8h+ like the others I was asking about).

1 Like

Hi @max :wave:,

I was going to update you on that one on Monday with a longer post explaining other scaling/performance issues we noticed and are planning on fixing in the coming month.

In this case I think what happened was that I introduced a bug which overloaded our event management system. As a result a bunch of events were handled with delays which is why the Job appeared to take longer than it did.

We’ll be writing a more detailed post on Monday with further details about known issues and fixes we plan on implementing.

Cheers,

Antoine

2 Likes

Hi @antoinedao, unfortunately I’m getting a similar issue again. Here is a daylighting simulation (single run) that took about 20min to run locally but has been running for an hour on Pollination and still isn’t finished: ef6c1f1f-4173-5bdf-90e9-5cc74aba0969

Is that normal? This one has 14000 points but I set ambient bounces to 0.

Hi @Max :wave:

Unfortunately we have noticed that the issue of Runs/Jobs not being updated correctly is still persisting despite implementing some fixes to resolve it… We have noticed that this issue is also affecting @compdesignernz and @patryk_wozniczka. We’re working hard to figure out which bit of our pipeline is failing to update the status of runs/jobs however we do want to re-assure you that the problem is not with the cloud execution or the compute elasticity. Our issues are firmly grounded in how we update the status of runs/jobs from the cloud executor to our backend runs service.

I am off on holidays this week so @tyler will be looking into it and manually running updates whenever we noticed a back log of incomplete jobs. If this is still not resolved once I’m back from holidays then I think we will drop other development work to focus solely on resolving this issue as we appreciate it makes it difficult for you guys to do your job using our platform.

Thanks for being patient with us while we fix it.

Antoine

1 Like

Hi @antoinedao and @tyler, just letting you know that the daylighting simulation with 0 ambient bounces that I mentioned earlier did complete after 1:15, but the one with 4 ambient bounces that I sent off afterwards completed in a shorter timeframe (1:10), but maybe this was because Tyler updated it manually as you say.

No manual updates from me for this one!

I haven’t looked into this job specifically (I’m not sure that I can see it on the platform if it’s private) , but one dynamic that can affect simulation time is how “warm” the compute cluster is. We have it set to scale dynamically based on load and, when the load is increased for the first time after scaling down, the job that incurs the load can experience degraded performance while the cluster acquires more compute nodes.

You can test this by running these jobs in quick succession, but run the one with 0 ambient bounces again immediately after the one with 4 ambient bounces. If the slower runtime is due to the warm-up time, you should see its runtime decrease.

We are planning to have options to avoid this situation in the future.

As @antoinedao mentioned, we have also been experiencing an issue with the apparent runtime on Pollination which is solely due to how we update the status, rather than the runtime of executing the actual simulations. I’ve started looking into it and we will update everyone when we have a solution.

Thanks again for your patience and let us know any other issues you have!

1 Like

Hi @Max, I suggest you to wait for testing this until we do the new round of inspections and make sure the time reported on Pollination UI is the correct representation of the compute time. Right now you have the mix of compute and wait time which might make the comparison misleading.

2 Likes

@mostapha I’m experiencing a similar discrepancy in runtimes, one job will take four minutes, while the rest are taking between 30 and 120 minutes, the models all being extremely similar (just fin rotations/depth changes). Is there any update on this? I would expect if one run takes 4 minutes, running in parallel should also take 4 minutes plus the startup time to distribute all the jobs (in this case 117).

Hi @archgame, I checked on our server, and your jobs are finished. Some of them have failed because of memory issues which we can resolve by adding the memory allowance. We have to inspect those.

This is correct assuming you have access to an unlimited number of CPUs but with your beta-testing account, the access is limited to 100 CPUs for each user. And each run uses several CPUs as it runs in parallel. That’s what adds to the overall time.

That said, it should not take as long as it takes for you. This has been a recurring issue and we have tried to fix it by making adjustments but the root of the issue is because of how we update the progress report and try to keep track of every change in the progress. This reporting gets really computationally expensive and generates thousands of events which slows down the whole process. This needs some refactoring on our side which we will do after we roll out the payment infrastructure and have an initial release of the apps.

I know that this is a major issue for larger studies and a roadblock for running large studies that have several steps and we will get to it. I apologize for the inconvenience and we will make sure to resolve this before ending the beta testing period.

I know it is not ideal but until we fully resolve this issue you should run the simulations in smaller patches or run them locally. :neutral_face: I can’t wait for the day that we mark this discussion as resolved. :white_check_mark:

I’m glad to report that we finally merged a fully refactored progress report implementation that should resolve the issue of some of the runs not being updated. We can finally make this topic resolved! :white_check_mark: And I can finally get better sleep at night. :grinning: Thank you, @antoinedao! :raised_hands:

1 Like