TypeError: 'str' 不支持 html2text 中的缓冲区接口 [英] TypeError: 'str' does not support the buffer interface in html2text

查看:20
本文介绍了TypeError: 'str' 不支持 html2text 中的缓冲区接口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 python3 进行一些网页抓取.我想使用以下代码保存网页并将其转换为文本:

I'm using python3 to do some web scraping. I want to save a webpage and convert it to text using the following code:

import urllib
import html2text
url='http://www.google.com'
page = urllib.request.urlopen(url)
html_content = page.read()
rendered_content = html2text.html2text(html_content)

但是当我运行代码时,它报告了一个类型错误:

But when I run the code, it reports a type error:

  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/html2text-2016.4.2-py3.4.egg/html2text/__init__.py", line 127, in feed
    data = data.replace("</' + 'script>", "</ignore>")
TypeError: 'str' does not support the buffer interface

谁能告诉我如何处理这个错误?提前致谢!

Could anyone tell me how to deal with this error? Thank you in advance!

推荐答案

我花时间调查了这个问题,结果很容易解决.

I took the time to investigate this, and it turns out to be easily resolved.

问题是输入错误之一:当您调用 page.read() 时,返回的是 字节字符串,而不是常规字符串.

The problem is one of bad input: when you called page.read(), a byte string was returned, rather than a regular string.

字节字符串是 Python 处理陌生字符编码的方式:基本上有原始文本中未映射到 Unicode(Python 3 的默认字符编码)的字符.

Byte strings are Python's way of dealing with unfamiliar character encodings: basically there are characters in the raw text that don't map to Unicode (Python 3's default character encoding).

因为 Python 不知道要使用什么编码,所以 Python 使用原始字节来表示这样的字符串——无论如何,这就是所有数据在内部表示的方式——并让程序员决定使用什么编码.

Because Python doesn't know what encoding to use, Python instead represents such strings using raw bytes - this is how all data is represented internally anyway - and lets the programmer decide what encoding to use.

对这些字节串调用的常规字符串方法 -​​ 例如 replace()html2text 尝试使用 - 失败,因为字节串没有定义这些方法.

Regular string methods called on these byte strings - such as replace(), which html2text tried to use - fail because byte strings don't have these methods defined.

html_content = page.read().decode('iso-8859-1')

Padraic Cunningham 在评论中的解决方案本质上是正确的:您必须首先告诉 Python 使用哪种字符编码来尝试将这些字节映射到正确的字符集.

Padraic Cunningham's solution in the comments is correct in its essence: you have to first tell Python which character encoding to use to try to map these bytes to correct character set.

不幸的是,这个特定文本不使用 Unicode,因此要求它使用 UTF-8 编码进行解码会引发错误.

Unfortunately, this particular text doesn't use Unicode, so asking it to decode using the UTF-8 encoding throws an error.

要使用的正确编码实际上包含在 Content-Type 标头下的请求标头本身中 - 这是一个标准标头,所有符合 HTTP 的服务器响应都包含在该标头中保证提供.

The correct encoding to use is actually contained in the request headers itself under the Content-Type header - this is a standard header that all HTTP-compliant server responses are guaranteed to provide.

只需调用 page.info().get_content_charset() 即可返回此标头的值,在本例中为 iso-8859-1.从那里,您可以使用 iso-8859-1 对其进行正确解码,以便常规工具可以对其进行正常操作.

Simply calling page.info().get_content_charset() returns the value of this header, which in this case is iso-8859-1. From there, you can decode it correctly using iso-8859-1, so that regular tools can operate on it normally.

charset_encoding = page.info().get_content_charset()
html_content = page.read().decode(charset_encoding)

这篇关于TypeError: 'str' 不支持 html2text 中的缓冲区接口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆