使用lxml解析RSS时发生编码错误 [英] Encoding error while parsing RSS with lxml

查看:92
本文介绍了使用lxml解析RSS时发生编码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用lxml解析下载的RSS,但是我不知道如何使用UnicodeDecodeError处理?

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

但是我得到一个错误:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

推荐答案

您可能应该只尝试将字符编码定义为万不得已,因为很明显,编码是基于XML序言的(如果不是,则不是). HTTP标头.无论如何,除非您要覆盖编码,否则无需将编码传递给etree.XMLParser.因此请删除encoding参数,它应该可以工作.

You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.

好的,问题似乎出在lxml上.无论出于何种原因,以下内容都可以起作用:

okay, the problem actually seems to be with lxml. The following works, for whatever reason:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)

这篇关于使用lxml解析RSS时发生编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆