Using regex to correct OCR output identical chars (capital I vs. 1, etc.)
I have a trained OCR model that reads specific fonts. Some of these fonts have identical-appearing characters like 1's and capital i's, and so occasionally when wordlist predicting fails, I'll get inappropriate I's where 1's should be and 1's where I's should be.
In my case, I know that there should never be...
- a 1 within a string; e.g,
1NDEPENDENCE DAY
- an I within an integer; e.g,
45I OZ
- an I next to certain special characters like %, +, and -; e.g,
I% OFF
TEMP: -I DEGREES
- a solitary I--these will all be 1's; e.g,
TIME: I TO 5 PM
- consecutive I's; e.g.
II A.M.
This is my 1st attempt which addresses some of these cases, but I'm sure there's a more efficient way to do this. Maybe looping over list of regex expressions with re.sub)?
import re
ocr_output = "TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M."
while True:
x = re.search("[\d+-]I", ocr_output)
if x:
ocr_output = ocr_output[:x.start()+1] + '1' + ocr_output[x.start() + 2:]
else:
break
while True:
x = re.search("I[\d%-]", ocr_output)
if x:
ocr_output = ocr_output[:x.start()] + '1' + ocr_output[x.start() + 1:]
else:
break
while True:
x = re.search("[A-Z]1", ocr_output)
if x:
ocr_output = ocr_output[:x.start()+1] + 'I' + ocr_output[x.start() + 2:]
else:
break
while True:
x = re.search("1[A-Z]", ocr_output)
if x:
ocr_output = ocr_output[:x.start()] + 'I' + ocr_output[x.start() + 1:]
else:
break
print(ocr_output)
>>>TIME: TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, II A.M.
What more elegant solutions can you think of to correct my OCR output for these cases? I'm working in python. Thank you!
this is what i came up with:
def preprocess_ocr_output(text: str) -> str:
output = text
output = re.sub(r"1(?![\s%])(?=\w+)", "I", output)
output = re.sub(r"(?<=\w)(?<![\s+\-])1", "I", output)
output = re.sub(r"I(?!\s)(?=[\d%])", "1", output)
output = re.sub(r"(?<=[+\-\d])(?<!\s)I", "1", output)
return output
a solitary I--these will all be 1's; e.g, TIME: I to 5 PM
I don't think u can save that floating "I" -> "1" without causing problems else where...
I would use these kind of regexes for the general cases. Don't forget to precompile the regexes for performance reasons.
import re
I_regex = re.compile(r"(?<=[%I0-9\-+])I|I(?=[%I0-9])")
One_regex = re.compile(r"(?<=[A-Z])1|1(?=[A-Z])|1(?=[a-z])")
def preprocess(text):
output = I_regex.sub('1', text)
output = One_regex.sub('I', output)
return output
Output:
>>> preprocess('TIME: I TO 5 PM, I% OFF, TEMP: -I DEGREES, II07 OZ, 1NDEPENDENCE DAY, II A.M.')
'TIME: I TO 5 PM, 1% OFF, TEMP: -1 DEGREES, 1107 OZ, INDEPENDENCE DAY, 11 A.M.'