English generator algorithms [closed]
This may be an odd question for this site, but tonight I've been enjoying myself by creating a small script that produces (is supposed to produce) sample sentences that resemble English, while being total gibberish.
The idea came from reading a question on StackOverflow.com which involved word wrapping of a text. Some people would use the Lorem Ipsum quote to generate a sample text for demonstration purposes. I thought, why this would be a nice use of a random text generator.
The very intriguing Wug test was also at the back of my mind, and the fact that it is relatively easy to read a sentence with scrambled words, as long as beginning and end letters remain the same. For example:
Ocne uopn a mndihgit derary, whlie I peonredd waek and warey oevr mnay a quinat and ciruous vumole of fgtorteon lero,
I have done some research on (concerning English):
- Word length distribution (Using an approximation of Zipf's Law I found online)
- Letter distribution and first letter distribution
Adding some random punctuation and capitalization, it is looking pretty, but I need some simple algorithms to make the words more realistic looking. Here's a sample text:
Ynssdto lcianche ttlkise aaricod oawsepje. Hast tvnvcfaiesont eteoy prae wwecofuothenroo nmtnhglw lmhwefc etlugloe. Ywio odhw, chlt dhpei tiaqirter, sorrdstg aontli kayhut, tnust, berv dosp wrhhys sblfm. Nkttrbfoeret thpit atea aoecwb ctwrhfae oneeot selm teihug ttolgktrwwmc, wwrleil sga, isdeedeo adnrsi, aydhd asroino dhddonn, lrctp gckort ikhcvo. Tvte hzmdosnd wsad a cwfndoac drnsrtsaths
Obviously, words should contain at least one vowel. It might in fact be idea to make vowel insertion a distinct part of the process. Some consonants should not follow each other (e.g. tvnvcf
), and should not be too many in a row.
I was looking for a distribution of the last letters in English words, but that may not be applicable, since word endings can be fairly similar (ing, ane, tion, able
, etc), and that might add some familiarity to the sentences.
I'm looking for ideas. Links to resources. Rules of thumb. What can I do to make my script spout more legible gibberish?
In short, what are the general rules for building an English-looking word?
If you pick up an intro to linguistics text, it'll have something for you (like "Relevant Linguistics" by Paul Justice). I know that RL deals with this problem specifically. The key addition from linguistics would be that the way we produce sounds physically affects what kinds of words can be "believable" or even "conceivable."
For example, in your random text, there's a "word" called "Ynssdto." Let's make the Y sound like a short I (like "in") and call the double S's a single S sound (like "guess"). That brings us to an odd combination of what we call "alveolar-dental plosives" (if my terminology isn't too rusty). ADP's are "explosive"-type sounds (they make a puff of air) produced by placing our tongue where our teeth meet the roof of our mouths. This combination of sounds is not possible in English, and I would wager in any language. You'd need a vowel BETWEEN those two sounds. Like in "tada!"
I know nothing of programming, but here's what I think could solve the problem. First, classify letters by their manner of production, then assign rules governing their distribution in the words. One rule might be that "dental plosives cannot follow one another in the same word."
Or "no interdental fricatives can follow one another in the same word" (IDF's are the "th" sounds in "this" and "thin". Fricatives produce sound by buzzing...think "friction"..."sh" "z" and "s" are all fricatives). I bet that no two fricatives of any sort can follow each other (like "th" + "sh" + "z").
Or "two stops cannot follow each other" like in "gckort." [g] and [k] are both "glottal stops" made by stopping the flow of air in the back of the throat for a moment. Similarly, a glottal stop could not combine with a "alveolar stop" like [t] without a vowel in between. Gt? No. Git? OK.
Some good news: linguists have already classified manner of production for all phonemes (sounds) in all languages. RL actually gives a short set of rules for combining phonemes, and some nonsense words to demonstrate how these rules work! This will be a BIG step in the right direction.
BTW RL is a user-friendly text that should be very accessible, but for that exhaustive list of phoneme production location, you might need to grab a more detailed text.
Good luck! I think your project sounds really cool!
I once wrote a program (long lost now, sorry) which processed a corpus of text, recording the frequency of each character given the previous sequence of n characters. So for example for the word "hello" and n=3 , it would do:
- frequency[null,null,null,h] ++
- frequency[null,null,h,e] ++
- frequency[null,h,e,l] ++
- frequency[h,e,l,l] ++
- frequency[e,l,l,o] ++
- frequency[l,l,o,end] ++
Then it would generate words by starting with a random letter, then picking the next letter using the weightings obtained from the corpus, until it picks an 'end'.
For too low values of n, you get unpronounceable words.
For too high values of n, you tend to get mostly real words.
If you tune n just right, you get a good selection of novel, pronounceable, English-looking words (or whatever language corpus you fed in). It's quite fun seeing what difference the corpus makes. The works of Shakespeare generates qualitatively different words to the Bible or just a Scrabble word list, for example.
I think this could be extended to sentences. One simple thing to try would be to treat the space character as just another letter. You could go one better and adjust the window size n depending on whether you're on a word boundary etc.
You could also try to classify your nonsense words into parts of speech based on some statistical heuristics (Not trivial I guess. Train a neural net to do it!)