How to download any(!) webpage with correct charset in python?

Solution 1:

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

You can use BeautifulSoup to locate a meta element in the HTML:

soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

Solution 2:

Use the Universal Encoding Detector:

>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

The other option would be to just use wget:

  import os
  h = os.popen('wget -q -O foo1.txt http://foo.html')
  h.close()
  s = open('foo1.txt').read()

Solution 3:

It seems like you need a hybrid of the answers presented:

Fetch the page using urllib
Find <meta> tags using beautiful soup or other method
If no meta tags exist, check the headers returned by urllib
If that still doesn't give you an answer, use the universal encoding detector.

I honestly don't believe you're going to find anything better than that.

In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.

If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.

Proving $5^n \equiv 1 \pmod {2^r}$ when $n=2^{r-2}$

Prove $ne^{-n}$ converges to zero

Simple series convergence/divergence: $\sum_{k=1}^{\infty}\frac{2^{k}k!}{k^{k}}$

Probability that a triangle can be formed from a permutation of three edges of random length

Prove that$ H_x (X)$ does not depend on the choice of local parametrization.

Inequality for the combined resistance of two resistors connected in parallel

Does $\int_0^{2 \pi} \sqrt{1-(a+b \sin\phi)^2} d\phi $ have a closed form in terms of elliptic integrals?

Relating the normal bundle and trivial bundles of $S^n$ to the tautological and trivial line bundles of $\mathbb{R}P^n$

Limit of $x\ln{x}$

Convergents of square root of 2

If $n > 2$, prove that the order of the multiplicative group of units modulo n, $U_n$, is even.

Let $m \in \mathbb{Z^+} , n \in \mathbb{Z^+}$ and let $d=\gcd(m,n)$. Prove that $m\mathbb{Z}+n\mathbb{Z}=d\mathbb{Z}$

How to download any(!) webpage with correct charset in python?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts