Solution 1:

ABSTRACT:

            First I will explore several different corpora in several different ways, including original research. I will also provide a cautionary tale of how to deceptively misuse and abuse Google N-Grams to provide a pretty picture that seems to prove whatever point you care to make. Finally at the end I will provide an innovative solution to the current quandary, one that is guaranteed to offend no one.

TL;DR: Avoid the issue altogether and use dataset and datapoint as needed.


In the Hallowed Halls of Academe

In the sciences, meaning in research articles published in refereed journals, data is almost always construed to be plural, and datum reserved for the singular. As with all specialized, technical vocabulary, this identifies the writer as a member of that particular in-group, and not an outsider with no appreciation for the topic. You see this everywhere from biomedicine to linguistics.

In surveying two different corpora of biomedical research articles, the ratio of plural uses of data to singular uses is extremely high. I examined the PubMed Ceneral Open Access (PMCOA) collection, and I also examined Elsevier. Here were my criteria:

  • I counted as plural uses of data when immediately preceded by the plural determiner these, so “these data”, as well as when data was immediately followed by any of have, were, are, or do.

  • I counted as singular uses of data when immediately preceded by the singular determiner this, so “this data”, as well as when data was immediately followed by any of has, was, is, or does.

I did not attempt to detect the paired determiners those data versus that data, because of the difficulty in distinguishing that used as a determiner from when it is used as a relative pronoun. I also did not look for contractions, because I worried that this might bias the results because of the informality of contractions paired with that of using data in the singular.

Here were my results against these two corpora:

  • For the large Elsevier corpus of around a million journal articles, I found that the ratio of plural uses over singular ones to be in the neighborhood of 9:1 in favor of the plural.

  • For the smaller PMCOA corpus of around 150,000 journal articles examined, this ratio was lower but still clearly positive at around 4:1 in favor of the plural.

Why Elsevier is higher that PMCOA is an interesting question, but not really germane to the ultimate advice I intend to give. Curiously, this ratio varies quite considerably by journal, suggesting that the editors of particular publications may enforce an editorial policy here. For example, just looking at journals from PMCOA:

Diabetes                  files   433 total 1633 pl 1609 sg   24 ratio 67.0 : 1
J_Exp_Med                 files   431 total 2348 pl 2309 sg   39 ratio 59.2 : 1
Diabetes_Care             files   547 total 1148 pl 1122 sg   26 ratio 43.2 : 1
J_Cell_Biol               files   627 total 1808 pl 1763 sg   45 ratio 39.2 : 1
J_Biol_Chem               files   408 total 1544 pl 1502 sg   42 ratio 35.8 : 1
J_Exp_Bot                 files   489 total 1063 pl 1016 sg   47 ratio 21.6 : 1
Arthritis_Res_Ther        files  2102 total 2992 pl 2842 sg  150 ratio 18.9 : 1
Emerg_Infect_Dis          files  2922 total 1366 pl 1287 sg   79 ratio 16.3 : 1
Breast_Cancer_Res         files  1354 total 1958 pl 1805 sg  153 ratio 11.8 : 1
Crit_Care                 files  2438 total 3482 pl 3202 sg  280 ratio 11.4 : 1
Environ_Health_Perspect   files 12723 total 5068 pl 4639 sg  429 ratio 10.8 : 1
PLoS_Biol                 files  2476 total 3976 pl 3614 sg  362 ratio 10.0 : 1

In contrast, only one journal with more than a thousand hits had at least as many singular as plural mentions:

BMC_Bioinformatics        files 3588 total 9413 pl 4609 sg 4804 ratio    1 : 1.0
Cancer_Inform             files  154 total  480 pl  231 sg  249 ratio    1 : 1.1
Algorithms_Mol_Biol       files   98 total  200 pl   84 sg  116 ratio    1 : 1.4
Front_Neuroinformatics    files   66 total  169 pl   83 sg   86 ratio    1 : 1.0
Indian_J_Community_Med    files  243 total  159 pl   73 sg   86 ratio    1 : 1.2
Indian_J_Pharm_Sci        files  299 total  110 pl   53 sg   57 ratio    1 : 1.1
Bioinform_Biol_Insights   files   60 total  103 pl   51 sg   52 ratio    1 : 1.0
Int_J_Telemed_Appl        files   40 total  102 pl   42 sg   60 ratio    1 : 1.4

If you are publishing a research article for the scientific community, you should clearly use data as a plural, which also opens up datum as a singular if you have need of it.

However, in other arenas than academic or scientific ones, the answer of which to choose is increasingly less obvious.


In English Books

When it comes to general books published in English, the preference for data in the plural still exists, but is nowhere near so strong as it is in scientific work. In all these plots, we see the following things:

  • There was next to no use of data as a singular during the 19th century.
  • There is a peak of overall use of data around 1980.
  • The use of data as a singular has substantially increased in the last few decades, particularly when considered as a ratio against its use as a plural.
  • Data as a plural still wins.

Here are the general English plots. The first of them is where the ratio is tightest.

  • Google N-Gram of the English corpus plotting plural data are in blue against singular data is in red:

Google N-Gram of the English corpus plotting plural *data are* in blue against singular *data is* in red

  • Google N-Gram of the English corpus plotting plural data were in blue against singular data was in red

Google N-Gram of the English corpus plotting plural *data were* in blue against singular *data was* in red

  • Google N-Gram of the English corpus plotting plural data do in blue against singular data does in red:

Google N-Gram of the English corpus plotting plural *data do* in blue against singular *data does* in red

  • Google N-Gram of the English corpus plotting plural these data in blue against singular this data in red

Google N-Gram of the English corpus plotting plural *these data* in blue against singular *this data* in red


Lies, Damned Lies, and Google N-Grams

Those were the general results in all books published in English. What about other corpora? If you run the same queries on just American English and on just British English, there is no significant difference. Choosing just English Fiction does not seem to alter the results either.

What does seem to make a difference is if you run the data out into the 21st century, especially in the English Fiction corpus, which may be more indicative of common speech than formal English.

First the plot starting from 1600 through 2008, the last year for which data are available:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1600–2008:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1600–2008

Now just from 1800–2008:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1800–2008:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1800–2008

Now just from 1800–2008:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1900–2008:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1900–2008

And finally, here is the plot for the last 40 years for which data are currently available:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1968–2008

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1968-2008

As you see, the range of years chosen changes the slope of the graph significantly. But that’s not all! All these have used a smoothing of 3. It turns out the smoothing factor selected makes a huge difference. Here again is the 1800–2008 data with smoothing set to 0:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1800–2008, with smoothing set to 0:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1800–2008, with smoothing set to 0

But watch what happens when the smoothing is set to 20:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1800–2008, with smoothing set to 20:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1800–2008, with smoothing set to 20

See how different that is? Two lines that seemed to almost cross are now far apart. You’re probably wondering about the last 40 years without smoothing, instead of with 3 as I presented them above. Here then are the last 40 years of data once again, but this time with smoothing set to 0:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1968–2008, with smoothing set to 0:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1968–2008, with smoothing set to 0

Whereas here it is with smoothing at 10:

  • Google N-Gram of the English Fiction corpus plotting plural data are in blue against singular data is in red, covering years 1968–2008, with smoothing set to 10:

Google N-Gram of the English Fiction corpus plotting plural *data are* in blue against singular *data is* in red, covering years 1968–2008, with smoothing set to 10

Isn’t that completely wild?!

What’s really the moral of the story here? That you must be exceedingly careful with Google N-Grams. Besides the extremely serious matter of whether the correct, relevant, and appropriate n-grams were chosen in the first place and which corpus they are run against, it is incredibly easy to tweak the graphs by changing the range, the cut-off date, and the smoothing to change the overall picture so much that you can easily contradict with one picture something shown in another picture using what are essentially the same data.

Never trust just a picture. Make sure you get the full link as I have done here, both so you know the actual values used, and so that you can check things out for yourself.


And now for something completely different!

The charts that extend the plot into the 3rd millennium are especially interesting, because this suggests that in the space of a generation, things have swung around from plural dominating the singular by a ratio greater than 2:1 to the present day in which the two are nearly equal.

If that is accurate, it means half the people will use one form, and half the people the other. It is almost as though one is guaranteed to annoy half the people all the time.

I therefore suggest not using data as a noun at all. Oh, you can retain it as an attributive noun, as in data processing or data type, since there is no question of numeric agreement there. But as soon as you try to make it agree, you are going to bother half the people. And you don’t want to do that.

One problem with using data in the singular is that there is now no reasonable plural, since *datas not only sounds abominable, it is completely meaningless given the use of data as a mass noun instead of a count noun.

I propose as the solution to all these that you should use data set as a collection of data points.

That is, use data point where scientists still occasionally use datum, and use data set (or more succinctly, dataset) when you mean a collection of these individual points. Moreover, you can now use datasets as a collection of collections.

  • This dataset is from ten years ago.
  • All datasets are to be gathered by the field research team.
  • Surely this particular datapoint is completely spurious.
  • I can’t see how to fit these three outlying datapoints to the curve.

This way you will offend no one: neither yourself nor your audience, no matter how erudite or pedestrian they may be.