Are dictionaries using 'big data'?

The short answer is yes, dictionaries do use corpora and electronic searches on those corpora to identify words and phrases, as well as their grammatical categories and semantic relationships.

Here is a quote from Macmillan:

"Using intelligent software... we can find every example in the corpus of a particular word, phrase, grammatical pattern, or collocation. It is this information which forms the basis for everything we say about words in the dictionary."

Macmillan goes on to describe how their software not only finds every occurrence of a word and it's variations, it also outputs one page summaries of the important grammatical and semantic relationships of the word. They give an example:

"The program first collects all the examples of the word being investigated.... Then it applies a second stage of analysis. This time, the software looks at particular grammatical relationships. In the case of evidence, it finds all the sentences where evidence is the object of a verb, then identifies the most frequent verbs used in this pattern.... [P]eople often talk (or write) about giving evidence, finding evidence, presenting evidence, or gathering evidence. Similarly, the... [software outputs] a list of the adjectives that most frequently modify this noun: we may say there is little evidence for something, or talk about clear evidence, strong evidence, or scientific evidence."

They also use the corpora to generate the labels for certain words, for example archaic, informal, American, journalism, etc. They can output these labels based on which documents in the corpora (for example, old texts or audio recordings) the words primarily appear in. Here is their example:

"When we look at all the examples of eatery in the corpus we find that a majority come from newspapers and magazines, and most of these newspapers and magazines are from the U.S. So in the dictionary, the word eatery has two ‘labels’: mainly american and mainly journalism."

One thing Macmillan does not mention, however, is finding synonyms for novel words. They suggests that the primary data for new words is citations. The examples they give are (1) using green as a transitive verb to mean "to make something more environmentally friendly" and (2) using handbags as an adjective. But still, if these uses occur in the corpus, they are obviously subject to the same kind of analysis as other words.

These are just example quotes from Macmillan, but certainly other big dictionaries like Oxford are doing the same thing. If you search around, you are likely to find information about their processes.