urllib IncompleteRead() error can I solve by just re-requesting?

I am running a script that is scraping several hundred pages on a site but recently I have been running into IncompleteRead() errors. My understanding is from looking on stackoverflow is that they can happen for any number of unknown reasons.

The error is caused randomly by the Request() function I believe from searching around:

    for ec in unq:
        print(ec)
        url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                              ec, headers={'User-Agent': 'Mozilla/5.0'})
        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')


    3.5.2.3
    2.1.3.15
    2.5.1.72
    1.5.1.2
    6.1.1.9
    3.2.2.27
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
        chunk_left = self._read_next_chunk_size()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
        return int(line, 16)
    
    ValueError: invalid literal for int() with base 16: b''
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
        chunk_left = self._get_chunk_left()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
        raise IncompleteRead(b'')
    
    IncompleteRead: IncompleteRead(0 bytes read)
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "<ipython-input-20-82f1876d3006>", line 5, in <module>
        html = urlopen(url).read()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
        return self._readall_chunked()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
        raise IncompleteRead(b''.join(value))
    
    IncompleteRead: IncompleteRead(1772944 bytes read)

The error happens randomly, as in not always the same url causes it, with https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 causing this specific one.

Some solutions seems to introduce a try clause but within the except they store the partial data (I think). Why is the the case, why not just resubmit the request?

If so how would I just re run the request as doing that normally seems to solve the issue. Beyond this I have no idea how I can fix the problem.

As per Serges answer, a try function seems to be the way:

The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.

As you have said, this can happen for numerous causes, and the occurence is at random. So:

you cannot predict when or for what file it will happen
you cannot prevent it to happen

The best you can do is to catch the error and retry, after an optional delay.

For example:

import time

for ec in unq:
    print(ec)
    url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                          ec, headers={'User-Agent': 'Mozilla/5.0'})
    sleep = 0
    for i in range(4):
        try:
            html = urlopen(url).read()
            break
        except http.client.IncompleteRead:
            if i == 3:
               raise       # give up after 4 attempts
            time.sleep(sleep) # optionaly add a delay here
            sleep += 5
    soup = BeautifulSoup(html, 'html.parser')

Solution 1:

The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.

As you have said, this can happen for numerous causes, and the occurence is at random. So:

you cannot predict when or for what file it will happen
you cannot prevent it to happen

The best you can do is to catch the error and retry, after an optional delay.

For example:

for ec in unq:
    print(ec)
    url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                          ec, headers={'User-Agent': 'Mozilla/5.0'})
    for i in range(4):
        try:
            html = urlopen(url).read()
            break
        except http.client.IncompleteRead:
            if i == 3:
               raise       # give up after 4 attempts
            # optionaly add a delay here
    soup = BeautifulSoup(html, 'html.parser')

urllib IncompleteRead() error can I solve by just re-requesting?

Solution 1:

Related

Recent Posts