Algorithms to identify Markov generated content?

Solution 1:

One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.

I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.

The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.

As an example, here's an aphorism from the kantmachine:

Today, he would feel convinced that the human will is free; to-morrow, considering the indissoluble chain of nature, he would look on freedom as a mere illusion and declare nature to be all-in-all.

While this string was written by a computer program, it's hard to say that a human would never say this.

I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.

Solution 2:

You could use a "brute force" approach, whereby you compare the generated language to data collected on n-grams of higher order than the Markov model that generated it.

i.e. If the language was generated with a 2nd order Markov model, up to 3-grams are going to have the correct frequencies, but 4-grams probably won't.

You can get up to 5-gram frequencies from Google's public n-gram dataset. It's huge though - 24G compressed - you need to get it by post on DVDs from LDC.

EDIT: Added some implementation details

The n-grams have already been counted, so you just need to store the counts (or frequencies) in a way that's quick to search. A properly indexed database, or perhaps a Lucene index should work.

Given a piece of text, scan across it and look up the frequency of each 5-gram in your database, and see where it ranks compared to other 5-grams that start with the same 4 words.

Practically, a bigger obstacle might be the licensing terms of the dataset. Using it for a commercial app might be prohibited.

Solution 3:

I suggest a generalization of Evan's answer: make a Markov model of your own and train it with a big chunk of the (very large) sample you're given, reserving the rest of the sample as "test data". Now, see how well the model you've trained does on the test data, e.g. with a chi square test that will suggest situation in which "fit is TOO good" (suggesting the test data is indeed generated by this model) as well as ones in which the fit is very bad (suggesting error in model structure -- an over-trained model with the wrong structure does a notoriously bad job in such cases).

Of course there are still many issues for calibration, such as the structure of the model -- are you suspecting a simple model based on Ntuples of words and little more, or a more sophisticated one with grammar states and the like. Fortunately you can calibrate things pretty well by using large corpora of known-to-be-natural text and also ones you generate yourself with models of various structures.

A different approach is to use nltk to parse the sentences you're given -- a small number of mis-parses is to be expected even in natural text (as humans are imperfect and so is the parser -- it may not know that word X can be used as a verb and only classify it as a noun, etc, etc), but most Markov models (unless they're modeling essentially the same grammar structure your parser happens to be using -- and you can use several parsers to try and counteract that!-) will cause VASTLY more mis-parses than even dyslexic humans. Again, calibrate that on natural vs synthetic texts, and you'll see what I mean!-)