Python 和 BeautifulSoup 编码问题 [英] Python and BeautifulSoup encoding issues
问题描述
我正在使用 BeautifulSoup 用 Python 编写一个爬虫,一切都很顺利,直到我遇到了这个网站:
I'm writing a crawler with Python using BeautifulSoup, and everything was going swimmingly till I ran into this site:
我正在获取请求库的内容:
I'm getting the contents with the requests library:
r = requests.get('http://www.elnorte.ec/')
content = r.content
如果我当时打印了 content 变量,所有西班牙语特殊字符似乎都可以正常工作.但是,一旦我尝试将内容变量提供给 BeautifulSoup,一切都会变得一团糟:
If I do a print of the content variable at that point, all the spanish special characters seem to be working fine. However, once I try to feed the content variable to BeautifulSoup it all gets messed up:
soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&month=08&day=27&modid=203" title="1009 artÃculos en este dÃa">
...
它显然混淆了所有西班牙语特殊字符(口音和诸如此类).我试过做 content.decode('utf-8'), content.decode('latin-1'),也试过把 fromEncoding 参数设置为 BeautifulSoup,将它设置为 fromEncoding='utf-8' 和 fromEncoding='latin-1',但仍然没有骰子.
It's apparently garbling up all the spanish special characters (accents and whatnot). I've tried doing content.decode('utf-8'), content.decode('latin-1'), also tried messing around with the fromEncoding parameter to BeautifulSoup, setting it to fromEncoding='utf-8' and fromEncoding='latin-1', but still no dice.
任何指针将不胜感激.
推荐答案
你能试试吗:
r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()
print x.prettify('latin-1')
我得到了正确的输出.哦,在这种特殊情况下,您还可以x.__str__(encoding='latin1')
.
I get the correct output.
Oh, in this special case you could also x.__str__(encoding='latin1')
.
我猜这是因为内容在 ISO-8859-1(5) 中并且元 http-equiv 内容类型错误地显示为UTF-8".
I guess this is because the content is in ISO-8859-1(5) and the meta http-equiv content-type incorrectly says "UTF-8".
你能确认一下吗?
这篇关于Python 和 BeautifulSoup 编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!