尽管utf8编码某些字符无法被识别 [英] Despite utf8 encoding some characters fail to be recognized

查看:130
本文介绍了尽管utf8编码某些字符无法被识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想凑一个RSS与新闻标题是这样的:

I'm trying to scrape an RSS with a news title like this:

<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>

这实际上是我用美丽的汤我怎么刮的:

This is effectively how I'm using Beautiful Soup to scrape it:

soup = BeautifulSoup(xml, 'xml')
start = soup.findAll('item')
for i in start:
    news, is_created = News.create_or_update(news_id,                                                  
    head_line=i.title.text.encode('utf-8').strip(),
    ...)

然而,尽管这种努力的称号依然是这样的:

However despite this effort the title remains like this:

Photo of iceberg that is believed to have sunk Titanic sold at auction for \xa321,000 alongside &#039;world&#039;s most valuable biscuit&#039;

难道是容易只是这些特殊字符转换成ASCII字符?

Would it be easier just to convert these special characters into ASCII character?

推荐答案

为您提供范例,这对我的作品:

For the example you provide, this works for me:

from bs4 import BeautifulSoup
import html

xml='<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>'
soup = BeautifulSoup(xml, 'lxml')
print(html.unescape(soup.get_text()))

html.unescape 处理HTML实体。如果美丽的汤是不正确处理井号,你可能需要指定创建 BeautifulSoup 对象,例如当编码。

html.unescape handles the HTML entities. If Beautiful Soup is not handling the pound sign correctly, you may need to specify the encoding when creating the BeautifulSoup object, e.g.

soup = BeautifulSoup(xml, "lxml", from_encoding='latin-1')

这篇关于尽管utf8编码某些字符无法被识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆