Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:

tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a

My question is there any way to detect strings that similar to ones above ?

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

Some new brands or products will be ignored if I will consider "regular words".

Thank you for your help


Solution 1:

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

For background, read about Markov Chains.

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

Solution 2:

You could do what Stackoverflow does and calculate the entropy of the string.

Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.

Solution 3:

Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.