500 Internal Server error when checking study status

I am using the following function to check the progress of a job executed on the cloud.

def check_study_status(study: Job):
    status = study.status.status

    while True:
        status_info = study.status
        print('\t# ------------------ #')
        print(f'\t# pending runs: {status_info.runs_pending}')
        print(f'\t# running runs: {status_info.runs_running}')
        print(f'\t# failed runs: {status_info.runs_failed}')
        print(f'\t# completed runs: {status_info.runs_completed}')
        if status in [
            JobStatusEnum.pre_processing, JobStatusEnum.running, JobStatusEnum.created,
            status = status_info.status
            # study is finished

Previously it was working ok. But today, it returns the following message:

Traceback (most recent call last):

  File ~\AppData\Local\anaconda3\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\arr18sep\desktop\pollination\marl_cloud.py:860

  File ~\Desktop\pollination\herp\pollination_interact.py:136 in check_study_status

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:95 in refresh

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:88 in _fetch_runs
    self._runs = self.run_api.get_runs(self.owner, self.project, self.id)

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:22 in get_runs
    return self._run_results_request(owner, project, job_id)

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:10 in _run_results_request
    res = self.client.get(

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\client.py:82 in get

  File ~\AppData\Local\anaconda3\lib\site-packages\requests\models.py:1021 in raise_for_status
    raise HTTPError(http_error_msg, response=self)

HTTPError: 500 Server Error: Internal Server Error for url: https://api.pollination.cloud/projects/centipede-llc/seventh_tst/results?job_id=55ece200-dbed-466f-a889-0051d4feaadb&page=1

I have seen this before, but it was unusual. Now it does not even let me download the first of a series of jobs to submit. The models are uploaded, but apparently, after the first run of the function, it loses communication with the server and does not receive any further response.

Hi @serpoag,

I suspect this was happening for the same reason that we had an issue with the license pool. I just checked the URL and I can get the response.

In any case, I owe you an improved code that catches the HTTP error and retries a few times before failing from this topic:

I’ll update the code in the other thread shortly. Meanwhile, can you try again to run your code and see if everything works as expected? Thanks!

Thanks for your help!
I have tried several times, even after receiving your response. The issue persists. The first batch of models is uploaded, but then the check study status function runs only once. The output freezes for more than a minute, and the same error is returned again. I have tried different IDEs, refreshing the API token and changing the project folder. Nothing seems to work.
This is the output just before the HTTP error:

Uploading model: My_model-2.hbjson
Uploading model: My_model-3.hbjson
Uploading model: My_model-6.hbjson
Uploading model: My_model-7.hbjson
	# ------------------ #
	# pending runs: 0
	# running runs: 4
	# failed runs: 0
	# completed runs: 0

It freezes there for more than a minute, and finally, the Error 500 is returned. This is definitely a communication error, as I can access my account and see the models uploaded and the jobs completed.

That is strange! I can see that the studies are scheduled successfully and everything looks fine from the web interface that uses the same API call.

Let me check and see what is going on! Sorry for the inconvenience.

@serpoag - I updated the code under the other topic and tried running it a couple of times. Everything works as expected. Can you give it a try and let me know if it also works for you? Thanks.

1 Like

Running perfectly as usual. Thanks! I tried with both the old code and the new one, both working fine. I guess it was a temporal issue from the server side. Did you find any possible cause?

Excellent! Yes. It was a misconfiguration on our end that affected a few internal calls. They were being timed out, so you would get a 500 response. This issue should not happen again. The other problem you faced before was because of the infrastructure unavailability which is out of our control. We have workflows in place that makes a recovery from those instances quickly but it can take a few seconds. The new check that I put in the code should keep your script running until the automated fix kicks in.

1 Like