500 Internal Server error when checking study status

serpoag · July 12, 2023, 3:43pm

Hi,
I am using the following function to check the progress of a job executed on the cloud.

def check_study_status(study: Job):
    """"""
    status = study.status.status

    while True:
        status_info = study.status
        print('\t# ------------------ #')
        print(f'\t# pending runs: {status_info.runs_pending}')
        print(f'\t# running runs: {status_info.runs_running}')
        print(f'\t# failed runs: {status_info.runs_failed}')
        print(f'\t# completed runs: {status_info.runs_completed}')
        if status in [
            JobStatusEnum.pre_processing, JobStatusEnum.running, JobStatusEnum.created,
            JobStatusEnum.unknown
        ]:
            time.sleep(15)
            study.refresh()
            status = status_info.status
        else:
            # study is finished
            time.sleep(2)
            break

Previously it was working ok. But today, it returns the following message:

Traceback (most recent call last):

  File ~\AppData\Local\anaconda3\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\arr18sep\desktop\pollination\marl_cloud.py:860
    check_study_status(study=study)

  File ~\Desktop\pollination\herp\pollination_interact.py:136 in check_study_status
    study.refresh()

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:95 in refresh
    self._fetch_runs()

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\interactors.py:88 in _fetch_runs
    self._runs = self.run_api.get_runs(self.owner, self.project, self.id)

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:22 in get_runs
    return self._run_results_request(owner, project, job_id)

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\runs.py:10 in _run_results_request
    res = self.client.get(

  File ~\AppData\Local\anaconda3\lib\site-packages\pollination_streamlit\api\client.py:82 in get
    res.raise_for_status()

  File ~\AppData\Local\anaconda3\lib\site-packages\requests\models.py:1021 in raise_for_status
    raise HTTPError(http_error_msg, response=self)

HTTPError: 500 Server Error: Internal Server Error for url: https://api.pollination.cloud/projects/centipede-llc/seventh_tst/results?job_id=55ece200-dbed-466f-a889-0051d4feaadb&page=1

I have seen this before, but it was unusual. Now it does not even let me download the first of a series of jobs to submit. The models are uploaded, but apparently, after the first run of the function, it loses communication with the server and does not receive any further response.

mostapha · July 12, 2023, 3:49pm

Hi @serpoag,

I suspect this was happening for the same reason that we had an issue with the license pool. I just checked the URL and I can get the response.

In any case, I owe you an improved code that catches the HTTP error and retries a few times before failing from this topic:

I’ll update the code in the other thread shortly. Meanwhile, can you try again to run your code and see if everything works as expected? Thanks!

serpoag · July 12, 2023, 4:43pm

Thanks for your help!
I have tried several times, even after receiving your response. The issue persists. The first batch of models is uploaded, but then the check study status function runs only once. The output freezes for more than a minute, and the same error is returned again. I have tried different IDEs, refreshing the API token and changing the project folder. Nothing seems to work.
This is the output just before the HTTP error:

Uploading model: My_model-2.hbjson
Uploading model: My_model-3.hbjson
Uploading model: My_model-6.hbjson
Uploading model: My_model-7.hbjson
https://app.pollination.cloud/centipede-llc/projects/seventh_tst/jobs/28c12e68-f53f-4173-8094-c995cccd39be
	# ------------------ #
	# pending runs: 0
	# running runs: 4
	# failed runs: 0
	# completed runs: 0

It freezes there for more than a minute, and finally, the Error 500 is returned. This is definitely a communication error, as I can access my account and see the models uploaded and the jobs completed.

mostapha · July 12, 2023, 4:49pm

That is strange! I can see that the studies are scheduled successfully and everything looks fine from the web interface that uses the same API call.

Let me check and see what is going on! Sorry for the inconvenience.

mostapha · July 12, 2023, 9:22pm

@serpoag - I updated the code under the other topic and tried running it a couple of times. Everything works as expected. Can you give it a try and let me know if it also works for you? Thanks.

serpoag · July 13, 2023, 9:03am

Running perfectly as usual. Thanks! I tried with both the old code and the new one, both working fine. I guess it was a temporal issue from the server side. Did you find any possible cause?

mostapha · July 13, 2023, 1:40pm

Excellent! Yes. It was a misconfiguration on our end that affected a few internal calls. They were being timed out, so you would get a 500 response. This issue should not happen again. The other problem you faced before was because of the infrastructure unavailability which is out of our control. We have workflows in place that makes a recovery from those instances quickly but it can take a few seconds. The new check that I put in the code should keep your script running until the automated fix kicks in.