How to detect if image contains ASCII characters?

I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:

enter image description here

As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.

After some research, I came across with tesseract OCR. In python I have the following code:

# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)

# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)

# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

None of them detected the string (prints whitespaces). How can I detect it?


you should prepare the image for the OCR.

for example, for this image I would do the following:

  1. convert it to Black & White image with threshold that make the text visible (for this image it is 130) enter image description here

  2. then I would Invert the image (so the text be in black) enter image description here

  3. now try tesseract OCR


You can use Easy-OCR instead of pytesseract to get directly this output

Kay 10 2003

and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import cv2
import easyocr

path = ""
img = cv2.imread(path+"input.jpg")

# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])

output = reader.readtext(img)

for i in range(len(output)):
    print(output[i][-2])

You can use inRange thresholding

The result will be:

enter image description here

If you set psm mode to the 6, the output will be:

<<
‘\
' MAY 18 2003

All the digits are captured correctly, but we have some unwanted characters. If we add an 'only-alpha numeric' condition, then the result will be:

['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']

First, I've upsampled the image, and then apply tesseract-OCR. The reason is that the date is too small to read.

Code:

import cv2
import pytesseract
from numpy import array

img = cv2.imread("result.png")  # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)

txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)

You can set the new values for the minimum and maximum ranges:

import numpy as np

min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)

You can also test with different psm parameters:

txt = pytesseract.image_to_string(thr, config='--psm 6')

For more read: Improving the quality of the output