BeautifulSoup的findall与类属性 - UNI code EN code错误 [英] BeautifulSoup findall with class attribute- unicode encode error

查看:200
本文介绍了BeautifulSoup的findall与类属性 - UNI code EN code错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用BeautifulSoup来提取黑客新闻新闻(只是标题),并有这么多的截止到现在 -

I am using BeautifulSoup to extract news stories(just the titles) from Hacker News and have this much up till now-

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

当我运行code,但是,它给出了一个错误 -

When I run the code, however, it gives an error-

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

我如何得到这个工作?

How do I get this to work?

推荐答案

由于BeautifulSoup与UNI code字符串内部工作原理。印刷UNI code字符串到控制台会导致Python来尝试的Python的默认编码通常是ASCII UNI code的转换。这将在一般失败非ASCII网站。您可以通过google搜索蟒蛇+单code学习Python和统一code的基础知识。同时转换
您的UNI code字符串使用为UTF-8

Because BeautifulSoup works internally with unicode strings. Printing unicode strings to the console will cause Python to try the conversion of unicode to the default encoding of Python which is usually ascii. This will in general fail for non-ascii web-site. You may learn the basics about Python and Unicode by googling for "python + unicode". Meanwhile convert your unicode strings to utf-8 using

print some_unicode_string.decode('utf-8')

这篇关于BeautifulSoup的findall与类属性 - UNI code EN code错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆