How does Chrome know what language a page is in?
I just opened a web page in Google Chrome, and it says "This page is in Japanese, would you like to translate it?".
Asking for a translation would presumably send the contents to Google, but how is the language identified in the first place? Is this done locally, in the browser? Or does this also send the page to Google? If so, should I not be asked for permission first? The page itself has no markup to indicate the language, and it is an internal intranet page, so that I am not at all sure that Google should be having access to its content.
Solution 1:
The Chrome browser can identify, or at least guess, the page language by looking at a number of on page factors:
- the http headers http://en.wikipedia.org/wiki/List_of_HTTP_header_fields
- the character encoding used
- the encoding meta tag
- a statistical analysis of the actual characters or words on the page
This can be done locally without any further internet connection or reporting to Google.
Translation of the content would definitely send the page content to Google servers for translation.
Solution 2:
The function is called DeterminePageLanguage
. It's in the file components/translate/core/language_detection/language_detection_util.cc
Chrome first checks the HTML lang
attribute and if it's not present it checks the Content-Language
HTTP header. Then it gets a prediction from cld3
.
The Compact Language Detector v3 (or CLD3) is a neural network model for language identification. The README states:
The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.
The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer.
So essentially, they downloaded copies of a bunch of websites and paid someone to look at the text on those websites and say what language they're written in. Next they split the text into n-grams (groups of n letters) and so on and used a neural network to learn a mapping between n-gram distributions and languages.
So now they have 2 variables:
-
language
which is set from either the HTML or the header (recall that the HTML attribute takes precedence if both are present) -
cld_language
which is a prediction based on the frequencies of groups of letters on the page
Then we hit this series of if-statements (I've edited out the part where they send analytics data about language mismatches)
if (language.empty()) {
return cld_language;
}
if (cld_language == kUnknownLanguageCode) {
return language;
}
if (CanCLDComplementSubCode(language, cld_language)) {
return cld_language;
}
if (IsSameOrSimilarLanguages(language, cld_language)) {
return language;
}
if (MaybeServerWrongConfiguration(language, cld_language)) {
return cld_language;
}
// Content-Language value might be wrong because CLD says that this page is
// written in another language with confidence. In this case, Chrome doesn't
// rely on any of the language codes, and gives up suggesting a translation.
return kUnknownLanguageCode;
CLD3 is small and is run locally. In fact, it's open source and they distribute a pre-trained model (although the code for training the model and the data they used isn't available). You can use it in your projects.
There's even Python bindings (unofficial and unmaintained) for the original C++ code (you'll need to install Cython)
pip install cld3