how to check if a string looks randomized, or human generated and pronouncable?

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like "bilbomoothof" .. it may be nonsense, but it still contains pronouncable sounds and so appears human-generated.

I accept that it could have been randomly generated from a dictionary of syllables, or word parts, but let's assume for a moment that the bot in question is a bit rubbish.

  1. Suppose you have a username like "sdfgbhm342r3f", to a human this is clearly a random string. But can this be identified programatically?
  2. Are there any algorithms available (similar to Soundex, etc..) that can identify pronounceable sounds within a string like this?

Solutions applicable in PHP/MySQL most appreciated.


Solution 1:

I guess you could think of something like that if you could restrict yourself to pronounceable sounds in english. For me (I am French), words like szczepan or wawrzyniec are unpronounceable and certainly have a certain randomness.

But they are actually Polish first names (meaning steven and lawrence)...

Solution 2:

I agree with Mac. But more than that, people sometimes have user name that aren't pronouncable, like qwerty or rtfmorleave.

Why bother with that ?

< obsolete and false, but i don't delete because of comments >

But more than that, no bots use 'zetztzgsd' as user name, they have dictionnary of realname, possible nick name, etc. so I think this would be a waster of time for you

< / obsolete and false, but i don't delete because of comments>

Solution 3:

Look up n-gram analysis. It is successfully used to automatically detect text language and works surprisingly well even on very short texts.

The online demo (no longer online) recognized 'bilbomoothof' as English and 'sdfgbhm342r3f' as Nepali. It probably always returns the best match, even if it's a very poor one. I think you could train it to discern between 'pronounceable' and 'random'.