使用lxml解析RSS时发生编码错误 [英] Encoding error while parsing RSS with lxml
问题描述
我想用lxml解析下载的RSS,但是我不知道如何使用UnicodeDecodeError处理?
I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
但是我得到一个错误:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
推荐答案
您可能应该只尝试将字符编码定义为万不得已,因为很明显,编码是基于XML序言的(如果不是,则不是). HTTP标头.无论如何,除非您要覆盖编码,否则无需将编码传递给etree.XMLParser
.因此请删除encoding
参数,它应该可以工作.
You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser
unless you want to override the encoding; so get rid of the encoding
parameter and it should work.
好的,问题似乎出在lxml
上.无论出于何种原因,以下内容都可以起作用:
okay, the problem actually seems to be with lxml
. The following works, for whatever reason:
parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
这篇关于使用lxml解析RSS时发生编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!