How to detect if image contains ASCII characters?

I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:

enter image description here

As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.

After some research, I came across with tesseract OCR. In python I have the following code:

# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)

# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)

# Attempt 3
import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

None of them detected the string (prints whitespaces). How can I detect it?

you should prepare the image for the OCR.

for example, for this image I would do the following:

convert it to Black & White image with threshold that make the text visible (for this image it is 130)
then I would Invert the image (so the text be in black)
now try tesseract OCR

You can use Easy-OCR instead of pytesseract to get directly this output

Kay 10 2003

and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import cv2
import easyocr

path = ""
img = cv2.imread(path+"input.jpg")

# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])

output = reader.readtext(img)

for i in range(len(output)):
    print(output[i][-2])

You can use inRange thresholding

The result will be:

enter image description here

If you set psm mode to the 6, the output will be:

<<
‘\
' MAY 18 2003

All the digits are captured correctly, but we have some unwanted characters. If we add an 'only-alpha numeric' condition, then the result will be:

['M', 'A', 'Y', '1', '8', '2', '0', '0', '3']

First, I've upsampled the image, and then apply tesseract-OCR. The reason is that the date is too small to read.

Code:

import cv2
import pytesseract
from numpy import array

img = cv2.imread("result.png")  # Load the upsampled image
img = cv2.cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
msk = cv2.inRange(img, array([0, 103, 171]), array([179, 255, 255]))
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=1)
thr = 255 - cv2.bitwise_and(dlt, msk)

txt = pytesseract.image_to_string(thr, config='--psm 6')
print([t for t in txt if t.isalnum()])
cv2.imshow("", thr)
cv2.waitKey(0)

You can set the new values for the minimum and maximum ranges:

import numpy as np

min_range = np.array([0, 103, 171])
max_range = np.array([179, 255, 255])
msk = cv2.inRange(img, min_range, max_range)

You can also test with different psm parameters:

txt = pytesseract.image_to_string(thr, config='--psm 6')

For more read: Improving the quality of the output

How to find parent's youngest child's name

Function to filter rows under certain conditions in a data set

Using Consul or Kubernetes Discovery Client depending on environment variable

sstabeloader in apache cassandra 4.0 is using 9042 transport port even after overriding it with the "-p" option

why isn't there any space at the ends for child elements of this flexbox container?

How to copy and paste values (non-adjacent columns, preserving values in target range)

Update state value of single object

Recycleview scroll to a position not working inside Nestedscrollview

Pod install error on M1 Mac - Flutter project with native code like shared_preferences

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

Sorting data to get size, mean and SD for each group in R

How to store input value in javascript local storage in the form of array? [duplicate]

How to detect if image contains ASCII characters?

Related

Recent Posts