Is there an Extended Backus–Naur form (EBNF) that covers all of English?
Is there an EBNF (Extended Backus–Naur form) that covers all of English, and if so, what is it?
No.
It's been well demonstrated in the linguistic literature that natural human languages, including English, cannot be captured in a context-free grammar.
Here's a link for you (PDF): Evidence against the context-freeness of natural language
The answer is best approximated as "yes", although there are some strictly non-context free components of English. The approximation of saying "English grammar is context free" is more true than false, in that the vast majority of the sentences you will encounter will be parsable by a simple EBNF, and all of them will be parseable by a mild extension of EBNF that I will describe below, and which does not require anything more than a stack plus some list-variables to keep track of permutable objects.
The permutable list variables you have to keep track of in the parsing algorithm are (probably) not strictly context free constructions, but they are not deep modifications of the context-free grammar idea, and there will be a preferred ordering of the list which is the one most commonly occurring in the New York Times, which is strictly context free.
I did not thoroughly bug-check the grammar, but it is mostly okeh.
The Generating problem and the Parsing Problem
There are two different problems in a formal grammar, the problem of generating all outputs and the problem of parsing a given output. To illustrate, consider the following BNF:
SENTENCE: ALIST | BLIST
ALIST: "a" ALIST | "a" BLIST | ""
BLIST: "b" BLIST | "b" ALIST | "ba" ALIST | ""
The very simple rules for generating the productions of the grammar are given above. The vertical line is read "or", and it separates options for expanding the objects in capitals. The stuff in quotes are actual letters, and the empty string "" means that you produce nothing. To start, you have one symbol
SENTENCE
Then you replace SENTENCE with one of the options to the right of the colon. For illustration, take ALIST
ALIST
Now you replace ALIST with one of its options (removing the quote marks),
a BLIST
Now replace BLIST with one of its options, and so on, until you run out of capital letter things. The capital letter things are called "nonterminals", and the lowercase things are called "terminals" or "leaves". The result in this case is a string of alternating a's and b's:
ababbabbab
Each finished product is a sentence of the language. Since I have given the rules explicitly, all productions are generatable by a simple algorithm.
The problem of parsing is going backwards--- given a sentence, figure out which rules were used to construct it. The problem of parsing is harder than the problem of generation, because it is usually ambiguous. The sentence
abaa
could have come from a-ba-a using the second BLIST option, or from a-b-a-b. This is the main difficulty in parsing, and different presentations of the same language sometimes have different amount of ambiguity. For special grammars, there is no ambiguity, and these are the kind that are most often used in computer languages. The C programming language is presented in an LALR (left-to-right) BNF that can be automatically parsed by the UNIX classics lex and yacc (or their free software versions, flex and bison).
The main type of grammars that people study are the context-free grammars, where the stuff on the left is just one symbol in capital letters, one nonterminal. General grammars are defined as the output of any conceivable computer program, and so are too general to be useful. There are also other classes, which I will completely ignore, because the context free class is the closest one to linguistics.
The prototypical example of a nontrivial context free grammar is parentheses matching of two different kinds of parentheses:
SENTENCE: S
S: "(" S ")" | "[" S "]"
This grammar generates all sentences which are properly balanced nested parentheses of two kinds, square or round. A more interesting example is
SENTENCE: EXPR
EXPR : EXPR "+" EXPR
| EXPR "*" EXPR
| "(" EXPR ")"
| NUMBER
NUMBER : "0" | NONZERODIGIT DIGITLIST
NONZERODIGIT: "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
DIGIT: "0" | NONZERODIGIT
DIGITLIST: DIGIT DIGITLIST | ""
You read this left to right as follows: "a SENTENCE is an EXPR", "An EXPR is an EXPR plus an EXPR or an EXPR times an EXPR, or an EXPR in parentheses, or a NUMBER", etc. This grammar generates every arithmetic expression you use in grammar school. The C programming language is generated by a simple context free grammar, as are nearly all programming languages in use today.
The context free grammars are distinguished from the regular grammars, which have at most one nonterminal on the left and on the right. The regular grammar describes a finite state language, which can be "lexed" by a "lexer" rather than "parsed" by a "parser". An example of a lexed regular language is the NUMBER grammar in the example above. You only need to remember a finite number of things in the past to determine if a number is well formed, and you never need to look at a far away place, as you do when you are trying to check if parentheses are properly balanced.
This stuff is standard, and is discussed in detail on Wikipedia and elsewhere. I only give a quick review to establish some examples, and for completeness sake. It is related to a fascinating theory of finite state machines and stack machines, which is good to understand.
The EBNF
A BNF is a set of rules for rewriting a nonterminal in terms of other nonterminals. I will give a trivial linguistic example.
SENTENCE: STATEMENT | QUESTION | COMMAND
STATEMENT: SUBJECT VERB ADVERBLIST OBJECT "."
QUESTION: "Hey," DO SUBJECT VERB ADVERBLIST OBJECT "?"
| DO SUBJECT VERB ADVERBLIST OBJECT "?"
| "Hey, dood, " DO SUBJECT VERB ADVERBLIST OBJECT "?"
COMMAND: VERB ADVERBLIST | VERB ADVERBLIST OBJECT
ADVERBLIST: ADVERB ADVERBLIST | ""
ADVERB: "quickly" "slowly"
DO: "do"
VERB: "run" "walk" "hop"
OBJECT: "here" "there" "everywhere"
SUBJECT "we" "they" "the bunnies"
This grammar is kind-of stupid (it is also regular, although the presentation above does not make it obvious at first glance). It generates a ridiculously tiny subset of English consisting of sentences, questions, and commands about bunnies and plural people who go places by various means of locomotion:
"hey, dood, do the bunnies run quickly quickly slowly here?"
"we hop slowly quickly here?"
The sentences are dumb. The only recursive part of the grammar is the part that generates the adjective list. This pattern, making a list, occurs very often:
ADVERBLIST : ADVERB ADVERBLIST | ""
I will abbreviate this nonterminal "ADVERBLIST" by calling it "ADVERB[]", with the brackets meaning a list of zero or more adverbs, and omitting the rule which forms it. This means I am using an extension of BNF, so it is EBNF (extended BNF).
Further, there are lots of cut-and-paste operations in the BNF for QUESTION above, corresponding to a certain thing being either present in one form or another. I will abreviate options using a nested parentheses language inside the BNF itself:
QUESTION : ("hey" | "hey, dood," | "") DO SUBJECT ADVERB[] VERB "?"
Hopefully this is self explanatory. This means that one BNF line can actually stand for very many replacement rules, since
QUESTION: ("hey" | "hey, dood," | "") DO ("" | "tell if" | "tell me if") SUBJECT ADVERB[] VERB, ("yo?" | "huh?" | "huh, yo?" | "?")
generates 36 different sentences.
Further, I will use the standard EBNF abbreviation of square brackets for an optional construction, [X] is the same as (X | ""). These types of EBNF are no good for lex and yacc, but you can manually expand them out to a real BNF, so they are still context free constructions.
QUESTION: ["hey" [",dood]] DO [ "tell" ["me"] "if" ] SUBJECT ADVERB[] VERB ["huh"]["yo"]"?"
The above is equivalent to the previous, but more compact.
Commutative extended BNF
The main problem with a BNF description is that it generates a completely different structure for the sentences
"to the store I went"
and
"I went to the store"
when their semantics are identical. The first is just a poetic rearrangement of the second, and a speaker will understand it just fine. But the BNF will glob together the "to the store" to a structure, which I will call an "ADVERB" (I really don't care what other people call it) because it attaches to "the store". The problem is that
ADVERB I went
and
I went ADVERB
attach in different ways in the formal grammar, because you need to go through the "I" in the first case. Further, the following production
I to the store went
is also marginally poetically acceptable. If you think this sounds forced, consider
"Without a fork, I, by a stream and on a chair, ate the chicken"
This is perfectly fine, and equivalent to
"I ate the chicken by a stream and on a chair without a fork."
In a normal BNF or EBNF description, the different ADVERBs in the sentence attach to the VERB in different ways depending on their relation to the verb, whether to the left or to the right, and even worse, in different ways depending on their relation to the subject, whether they are to the left or the right of the subject.
To fix this, I will leave the class of context free languages (maybe, perhaps this doesn't leave the class, I didn't prove it yet one way or the other, and it really doesn't matter for the practical purposes of linguistics). I will introduce another extension to the BNF, namely a "+" sign. A nonterminal with a plus sign attached can freely commute with all nonterminals not enclosed in parentheses by moving to the right. So during productions, it can slide around parentheses to the right as many times as it likes:
STATEMENT: ADVERB+[] SUBJECT (VERB OBJECT)
means that there is a list or ADVERB+ objects, which can slide past subject, or past the unit (VERB OBJECT). This construction is a pain in the neck to code generally, it isn't covered by lex and yacc. There might be a simple way to replace it with different rules, for example
STATEMENT: ADVERB[] SUBJECT ADVERB[] (VERB OBJECT) ADVERB[]
But this sort of thing leads to a prolifiration of different kinds of nodes (since each different kind of merge operation of two things into one in a basic description produces a different kind of node) and the prolification of node types (when their semantics is obviously the same) is the bugaboo of natural language generation. I will not put things in standard EBNF form, and I will use the "+" construction above.
While you can't feed this into yacc, it is still not to hard to code a parser by hand for commutative EBNF's (at least the ones that come up in English). Perhaps somebody will write an automated parser for commutative BNF's, but if the grammar is simple enough, a hand coding might be sufficient.
Nonobvious context-freeness of English
When you have a complex sentence, there are adjective-like and adverb-like phrases that can occur as embedded phrases:
"With a sword, tall and gallant, on a rock, breathless and encumbered, by a stream, over the field, armored in chains, without mercy, John slew the Jabberwock!" or
"Tall and gallant, by a stream, John, on a rock, breathless and encumbered, over the field, slew the Jabberwock, (John was) armored in chains, without mercy!"
I placed (John was) to indicate that the adjective like phrase should not attach to Jabberwock, which is the preferred parsing of native speakers. The attachement to John is also an ok parsing, but less salient.
The adjective-like phrases "tall and gallant" and "armored in chains" occur with no particular order preference in the list of adjective and adverb phrases. This means that as you parse the sentence, you keep track of the adjective list (each of which might recurse down several times) and the adverb list (each of these might recurse), and when you are done, you attach the adjectives to the subject, and the adverbs to the verb. The verb object is the exception--- it is tightly attached to the verb. You can't move this guy around, or stuff things between it and the verb.
I should point out that in terms of style, you are generally supposed to do it like this:
ADVERB, ADVERB ... ADVERB, ADJECTIVE, ADJECTIVE ..., ADJECTIVE SUBJECT ADJECTIVE ADJECTIVE VERB ADVERB, ADVERB...
So that it will appear:
"on the stream, by the river, without mercy, tall and gallant John, breathless and beaming, armored in chains, slew the Jabberwock over the field."
In this order, you can make it context free for sure, because you can glob up the initial adverbs and the final adverbs first, then the middle adjectives to the subject, then the adverbs, subject, and verb into a sentence. I don't like this, because to a speaker all permutations are semantically equivalent, since they attach the same things to the same things.
A subtle point appears here: while you are allowed to say
The man with a limp shot the deer.
If you say,
with a limp, the man shot the deer
the semantics changes--- the phrase becomes an adverb for sure. In the second position, it is more likely that you would bind "with a limp" as an adjective to "the man" rather than to the "shot the deer" as an adverb.
Recursive Skeleton English
The main commutative EBNF rules are:
SENTENCE: STATEMENT | QUESTION | COMMAND
STATEMENT: ADJ+[] ADVERB+[] (SUBJECT) ADJECTIVE+[] (VERB)
QUESTION: ADJ+[] ADVERB+[] (DO SUBJECT) ADJECTIVE+[] (VERB)
COMMAND: ADJ+[] ADVERB+[] (VERB)
ADVERB: ADV
| WHEN (STATEMENT)
| IF (STATEMENT)
| GERUND
| PREPLY NP
ADJECTIVE: GERUND | PREPISH NP | ADJ
WHEN: "when" | "whereby" | "until" | "unless"
IF: ["not"] ("if" | "only if" | "because")
AND: "and" | "but" | "or"
PREPLY: [not] ("to"|"onto"|"into"|"of"|"out of"|"in"|"within"|"by"|"with"|"without")
PREPISH: [not] ("to"|"of"|"in"|"by"|"with"|"without")
ADJ: ADJ ("and"|",") ADJ | GERUND
NP: (("a"|"the") (ADJ+[] SNP))} WHICH CLAUSE
CLAUSE: ADJECTIVE+[] ADVERB+[] SUBJECT VERB
| ADJECTIVE+[] ADVERB+[] SUBJECT VERB PREP
SUBJECT: NP
OBJECT: NP
WHICH: "that" | ""
GERUND: FOGHORN GER
FOGHORN: (PREPLY "-" NP)[] "-" OBJECT
ADJ: ADJ AND ADJ|
"purple"| "smelly"| "happy"| "windy" | "unbelievable" etc.
ADV: ADV AND ADV |
"quicky" | "slowly" | "happily"
NOUN: "cat" | "dog" | "man" | etc.
VERB: ("berate" [OBJECT]) | ("stop" [OBJECT]) | ("flee" [OBJECT]) | ("put" OBJECT) LOCATION+ ...
GER: "berating" | "stopping" | "fleeing" | "putting" etc.
Plurals, tenses, passive constructions, subject-verb matching, and a bunch of other things English incorporates are not incorporated in this BNF, and neither are the required argument rules, just the recursive structure of the phrases. The remaining stuff is complicated annoyance that doesn't touch the embedding structure. This embedding structure is what is more-or-less universal across all old world modern languages.
Disclaimer: This is a draft skeleton, and I have not debugged it. I am sure it is full of nonsense, since it is late at night.
Unambiguous English
The main ambiguity problem with parsing English is that sometimes the "with a fork" attaches to nouns and sometimes to verbs. For example:
I ate the Chicken with one leg with a fork
might be parsed as
ate(subject=me, object= Chicken(the:with one leg, with a fork)) ate(subject=me, object=Chicken(the: with one leg), with a fork) ate(subject=me, object=Chicken(the: with one leg, with a fork)) ate(subject=me(:with one leg), object=Chicken(the,singular),with a fork))
etc. depending on whether the eating was done with one leg, or with a fork, or whether these are attributes of myself, or of the chicken. This ambiguity makes it difficult to decide how to attach the ADVERB and ADJECTIVE arguments properly to what thing.
In order to eliminate this ambiguity, and make an unambiguously binding English-like language with the exact same word order, you can introduce the prepositions
with-ish of-ish on-ish to-ish
etc, to indicate that these attach to a nounphrase, not to a verb. I will write all ADJECTIVEs which modify the subject to the far left, and all adjectives which modify the various objects next to those objects. Further, there is the ambiguity of nesting level--- which depth of stack you are supposed to be on for each ADVERB and ADJECTIVE. I will indicate this using parentheses level. Then English becomes unambiguous. Further, these rules generally match the preferred style of the New York Times.
With-ish one leg, I ate (the Chicken) with a fork.
i.e. one-legged me ate the chicken using a fork.
I ate (the Chicken with-ish one leg) with a fork
i.e., I ate the one legged chicken using a fork
I ate (the Chicken with-ish one leg with-ish a fork)
i.e. I ate the one legged chicken which possesses a fork, which perhaps it uses as a prosthesis to replace its missing limb.
Unambiguous English might be a good target for starting semantics. It shouldn't require much to tell a computer that chickens can often have one leg, but more rarely have a fork. The semantic web construction from the nouns and the verbs and their attachments is not a hopeless project, but it is a difficult one, which requires ingenuity. The grammar problem is much easier, and a variation on the stuff above should solve it completely.
If by "English" you mean "any utterance that might be written or uttered by a native English speaker", then English — and indeed any language for that matter — is far too complex probably for such a thing to exist. (And as JSBangs points out, there are reasons for believing that such a thing couldn't even exist in principle.)
If by "English" you're prepared to accept a highly constrained subset of English, then you might consider the rewrite rules that exist, for example, in early Chomskyan models of syntax. In symbolic form, these express things like "PP → P NP" ("a prepositional phrase is rewritten as/comprises a preposition plus noun phrase").
In reality, applications that need to do syntax parsing usually end up employing an algorithm that allows multiple candidate analyses to be assessed in parallel. There are several variants of this approach that essentially fall under the umbrella of "dynamic programming" — look in particular at something called the Viterbi Algorithm.