Using regex to correct OCR output identical chars (capital I vs. 1, etc.)

I have a trained OCR model that reads specific fonts. Some of these fonts have identical-appearing characters like 1's and capital i's, and so occasionally when wordlist predicting fails, I'll get inappropriate I's where 1's should be and 1's where I's should be.

In my case, I know that there should never be...

a 1 within a string; e.g, 1NDEPENDENCE DAY
an I within an integer; e.g, 45I OZ
an I next to certain special characters like %, +, and -; e.g, I% OFF TEMP: -I DEGREES
a solitary I--these will all be 1's; e.g, TIME: I TO 5 PM
consecutive I's; e.g. II A.M.

This is my 1st attempt which addresses some of these cases, but I'm sure there's a more efficient way to do this. Maybe looping over list of regex expressions with re.sub)?

import re

ocr_output = "TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M."

while True:
    x = re.search("[\d+-]I", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()+1] + '1' + ocr_output[x.start() + 2:]
    else:
        break

while True:
    x = re.search("I[\d%-]", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()] + '1' + ocr_output[x.start() + 1:]
    else:
        break

while True:
    x = re.search("[A-Z]1", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()+1] + 'I' + ocr_output[x.start() + 2:]
    else:
        break

while True:
    x = re.search("1[A-Z]", ocr_output)
    if x:
        ocr_output = ocr_output[:x.start()] + 'I' + ocr_output[x.start() + 1:]
    else:
        break
    
print(ocr_output)

>>>TIME: TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, II A.M.

What more elegant solutions can you think of to correct my OCR output for these cases? I'm working in python. Thank you!

this is what i came up with:

def preprocess_ocr_output(text: str) -> str:
    output = text
    output = re.sub(r"1(?![\s%])(?=\w+)", "I", output)
    output = re.sub(r"(?<=\w)(?<![\s+\-])1", "I", output)
    output = re.sub(r"I(?!\s)(?=[\d%])", "1", output)
    output = re.sub(r"(?<=[+\-\d])(?<!\s)I", "1", output)
    return output

a solitary I--these will all be 1's; e.g, TIME: I to 5 PM

I don't think u can save that floating "I" -> "1" without causing problems else where...

I would use these kind of regexes for the general cases. Don't forget to precompile the regexes for performance reasons.

import re

I_regex = re.compile(r"(?<=[%I0-9\-+])I|I(?=[%I0-9])")
One_regex = re.compile(r"(?<=[A-Z])1|1(?=[A-Z])|1(?=[a-z])")
def preprocess(text):
    output = I_regex.sub('1', text)
    output = One_regex.sub('I', output)
    return output

Output:

>>> preprocess('TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M.')
'TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, 11 A.M.'

Using regex to correct OCR output identical chars (capital I vs. 1, etc.)

Related

Recent Posts