Most Common Parses of the English language?
I hope I've got the right forum. I want to know about English specifically, although this is a linguistics question.
A common task in NLP and Computational Linguistics is to generate parse trees for various sentences. How these sentences are parsed, for one, depends upon POS tagging which itself depends upon how to enumerate the Part of Speeches of the language.
What I am interested in is sort of a reverse view of the process. Just as I can ask what are the most common words for the English Language, I wish to ask what are the most common parses for the English Language which are found, if in fact this can be measured or approximated? I am interested in the fully expanded representations, not something as basic as NP VP
. For example: one might render:
The bat eats a cat
d n v d n
using one simplistic POS enumeration. Looking at this parse, I would ask what percentage of sentences in English follow this exact pattern?
I would utilize projects like
- http://books.google.com/ngrams/info/
- http://nltk.org
- google-ngram-stripper
plus some programming skills to get such statistics.