Why do more sentences start with 'data' as a plural, than when it is within the sentence?
I recently put together some Google n-grams for a short piece on the transition of the word data into a singular (mass) noun: Data are Beautiful: Data's story in grammar.
There was one peculiar finding:
When starting a sentence, the trend reverses. "The data are" and "Data are" are approximately twice as common as "The data is" and "Data is". (n-gram)
The best suggestion for why, is that the writer is more careful at the beginning of the sentence (reddit comments).
Are there any actual grammatical reasons? Or, do you have any other guesses as to why the trend reverses?
If so, I'll add them to the story - and link back to here.
meta:* I'm not asking about the difference between data is and data are, so this question is not related to the earlier ones on ELU.
Here is an Ngram of sentence-beginning "The data is" (yellow) and "The data are" (green) versus later-in-the-sentence "the data is" (red) and "the data are" (blue):
As you note in your article, the sentence-beginning versions of these phrases show less of an inclination toward "data is" than the later-in-sentence versions do.
But consider this Ngram of sentence-beginning "The data shows" (yellow) and "The data show" (green) versus later-in-the-sentence "the data shows" (red) and "the data show" (blue):
Here the greater preference for the plural form at the beginning of a sentence versus elsewhere in the sentence is again evident, but more striking is the preference for the plural over the singular regardless of where the phrase falls in a sentence. Another interesting feature of this Ngram is that, for most of the years reported, sentence-beginning "The data show/shows" is slightly more common than later-in-the-sentence "the data show/shows"—a much different result than with "The data are/is" versus "the data are/is." I have no idea why this is so.
These results suggest that factors other than position in a sentence can have a powerful effect on the popularity of plural versus singular forms of data. In view of that, I would be hesitant to reach a broad conclusion about the overall impact of position in a sentence in such preferences.
A final caution involves Ngram Viewer charts in general: They are unreliable in various ways, starting with the OCR program's not infrequent misreading of publication dates and search strings, and the search results feature's variation in reported results depending on the time frame selected. They are pretty to look at, though.