How to detect if image contains ASCII characters?
I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:
As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.
After some research, I came across with tesseract
OCR. In python I have the following code:
# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)
# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)
# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
None of them detected the string (prints whitespaces). How can I detect it?
you should prepare the image for the OCR.
for example, for this image I would do the following:
-
convert it to Black & White image with threshold that make the text visible (for this image it is 130)
-
then I would Invert the image (so the text be in black)
-
now try tesseract OCR
You can use Easy-OCR instead of pytesseract
to get directly this output
Kay 10 2003
and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import easyocr
path = ""
img = cv2.imread(path+"input.jpg")
# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])
output = reader.readtext(img)
for i in range(len(output)):
print(output[i][-2])
You can use inRange thresholding
The result will be:
If you set psm
mode to the 6, the output will be:
<<
‘\
' MAY 18 2003
All the digits are captured correctly, but we have some unwanted characters. If we add an 'only-alpha numeric' condition, then the result will be:
['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']
First, I've upsampled the image, and then apply tesseract-OCR. The reason is that the date is too small to read.
Code:
import cv2
import pytesseract
from numpy import array
img = cv2.imread("result.png") # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)
txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)
You can set the new values for the minimum and maximum ranges:
import numpy as np
min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)
You can also test with different psm
parameters:
txt = pytesseract.image_to_string(thr, config='--psm 6')
For more read: Improving the quality of the output