使用lxml和请求进行HTML抓取会导致unicode错误 [英] HTML scraping using lxml and requests gives a unicode error

查看：188 发布时间：2018/6/14 19:14:46 python html unicode web-scraping lxml

本文介绍了使用lxml和请求进行HTML抓取会导致unicode错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用此处提供的HTML刮板。它为他们提供的例子工作正常。但是，当我尝试将其与网页，我收到这个错误 - 支持的。请不要声明使用字节输入或XML片段。
我尝试了谷歌搜索，但找不到解决方案。我真的很感谢任何帮助。我想知道是否有方法使用Python将它复制为HTML。

编辑：

  from lxml import html 
导入请求
 page = requests.get（'http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln= PTEN& ln1 = PTEN& start = 130& end = 140& coords = bp％3AAA& sn =& ss =& hn =& sh =& id = 15＃'）
 tree = html。 fromstring（page.text）

谢谢。

解决方案

简短回答：使用 page.content ，而不是 page.text 。

从 http://lxml.de/ parsing.html＃python-unicode-string ：

lxml.etree中的解析器可以直接处理unicode字符串...然而，这需要unicode字符串本身并没有指定冲突的编码，因此谎报他们的真实编码。

从 http：// docs.python-requests.org/en/latest/user/quickstart/#response-content ：

请求会自动解码来自服务器的内容[如 r.text ]。 ...您还可以以字节形式访问响应主体[如 r.content ]。

blockquote>

所以你会发现， requests.text 和 lxml.etree 想要解码utf-8到unicode。但如果我们让 requests.text 进行解码，那么xml文件中的编码语句就会变成谎言。

因此，让我们让 requests.content 不要解码。这样 lxml 将会收到一个始终未解码的文件。

I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. I've tried googling but couldn't find a solution. I'd truly appreciate any help. I'd like to know if there's a way to copy it as HTML using Python.

Edit:
from lxml import html
import requests
page = requests.get('http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=PTEN&ln1=PTEN&start=130&end=140&coords=bp%3AAA&sn=&ss=&hn=&sh=&id=15#')
tree = html.fromstring(page.text)
Thank you.
解决方案
Short answer: use page.content, not page.text.

From http://lxml.de/parsing.html#python-unicode-strings :

the parsers in lxml.etree can handle unicode strings straight away ... This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding

From http://docs.python-requests.org/en/latest/user/quickstart/#response-content :

Requests will automatically decode content from the server [as r.text]. ... You can also access the response body as bytes [as r.content].

So you see, both requests.text and lxml.etree want to decode the utf-8 to unicode. But if we let requests.text do the decoding, then the encoding statement inside the xml file becomes a lie.

So, let's have requests.content do no decoding. That way lxml will receive a consistently undecoded file.

这篇关于使用lxml和请求进行HTML抓取会导致unicode错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用lxml和请求进行HTML抓取会导致unicode错误 [英] HTML scraping using lxml and requests gives a unicode error

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用lxml和请求进行HTML抓取会导致unicode错误 [英] HTML scraping using lxml and requests gives a unicode error

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭