在抓取网页时如何处理未知的编码？ [英] How to deal with unknown encoding when scraping webpages?

查看：116 发布时间：2018/5/4 11:35:45 python google-app-engine unicode

本文介绍了在抓取网页时如何处理未知的编码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一次刮一篇文章url的代码导致出现以下错误：

  UnicodeDecodeError：'ascii'编解码器无法解码8858位的字节0xe2：ordinal不在范围内（128）

以下是我的代码的最简单形式：

  from google.appengine.api import urlfetch 
 
 def fetch（url）：
 headers = {'User-Agent'：Chrome / 11.0.696.16 } 
 result = urlfetch.fetch（url，headers）
 if result.status_code == 200：
 return result.content

以下是我尝试过的另一个变体，结果相同：

  def fetch（url）：
 headers = {'User-Agent'：Chrome / 11.0.696.16} 
 result = urlfetch.fetch（url，headers）
 if result.status_code == 200：
s = result.content 
s = s.decode（' utf-8'）
s = s.encode（'utf-8'）
s = unicode（s，'utf-8'）
返回s

这是丑陋而脆弱的一个，它也不起作用：

<$ p $ （$ url）：
headers = {'User-Agent'：Chrome / 11.0.696.16}
result = urlfetch.fetch（url，headers）
if result.status_code == 200：
s = result.content

try：
s = s.decode（'iso-8859-1'）
除了：
通过
试试：
s = s.decode（'ascii'）
除外：
通过
试试：
s = s。解码（'GB2312'）
除外：
通过
试试：
s = s.decode（'Windows-1251'）
除外：
pass
尝试：
s = s.decode（'Windows-1252'）
除外：
s =无效

s = s.encode（'utf-8'）
s = unicode（s，'utf-8'）
返回s

最后一个变量返回s，因为字符串没有工作，除去。

所以，我要去不得不扩大我笨拙的尝试/除了构造包含所有可能的编码（即使工作？），还是有一个更简单的方法？

为什么我决定刮掉整个HTML，不只是BeautifulSoup？因为我想在稍后进行修剪，以避免GAE中的DeadlineExceedError。

我读过关于Unicode的所有优秀文章，以及它应该如何完成？是。然而，我没有找到一个解决方案，并不假设我知道传入的编码，我不这样做，因为我每天都在拼抢不同的网站。
解决方案
前段时间我有同样的问题，没有100％准确。我做的是：
$ b $ ul
从内容类型获取编码
从元标记获取编码

使用chardet Python模块检测编码

将文本从最常见的编码解码为Unicode

处理文本/ html

I'm scraping news articles from various sites, using GAE and Python.

The code where I scrape one article url at a time leads to the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
Here's my code in its simplest form:
from google.appengine.api import urlfetch def fetch(url): headers = {'User-Agent' : "Chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: return result.content
Here is another variant I have tried, with the same result:
def fetch(url): headers = {'User-Agent' : "Chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: s = result.content s = s.decode('utf-8') s = s.encode('utf-8') s = unicode(s,'utf-8') return s
Here's the ugly, brittle one, which also doesn't work:
def fetch(url): headers = {'User-Agent' : "Chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: s = result.content try: s = s.decode('iso-8859-1') except: pass try: s = s.decode('ascii') except: pass try: s = s.decode('GB2312') except: pass try: s = s.decode('Windows-1251') except: pass try: s = s.decode('Windows-1252') except: s = "did not work" s = s.encode('utf-8') s = unicode(s,'utf-8') return s
The last variant returns s as the string "did not work" from the last except.

So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?

Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.

Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.
解决方案
I had the same problem some time ago and there is nothing 100% accurate. What I did was:

Get encoding from Content-Type

Get encoding from meta tags

Detect encoding with chardet Python module

Decode text from the most common encoding to Unicode

Process the text/html

这篇关于在抓取网页时如何处理未知的编码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在抓取网页时如何处理未知的编码？ [英] How to deal with unknown encoding when scraping webpages?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在抓取网页时如何处理未知的编码？ [英] How to deal with unknown encoding when scraping webpages?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭