BeautifulSoup的findall与类属性 - UNI code EN code错误 [英] BeautifulSoup findall with class attribute- unicode encode error
问题描述
我使用BeautifulSoup来提取黑客新闻新闻(只是标题),并有这么多的截止到现在 -
I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-
import urllib2
from BeautifulSoup import BeautifulSoup
HN_url = "http://news.ycombinator.com"
def get_page():
page_html = urllib2.urlopen(HN_url)
return page_html
def get_stories(content):
soup = BeautifulSoup(content)
titles_html =[]
for td in soup.findAll("td", { "class":"title" }):
titles_html += td.findAll("a")
return titles_html
print get_stories(get_page()
)
当我运行code,但是,它给出了一个错误 -
When I run the code, however, it gives an error-
Traceback (most recent call last):
File "terminalHN.py", line 19, in <module>
print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)
我如何得到这个工作?
How do I get this to work?
推荐答案
由于BeautifulSoup与UNI code字符串内部工作原理。印刷UNI code字符串到控制台会导致Python来尝试的Python的默认编码通常是ASCII UNI code的转换。这将在一般失败非ASCII网站。您可以通过google搜索蟒蛇+单code学习Python和统一code的基础知识。同时转换
您的UNI code字符串使用为UTF-8
Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert your unicode strings to utf-8 using
print some_unicode_string.decode('utf-8')
这篇关于BeautifulSoup的findall与类属性 - UNI code EN code错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!