忽略Python中的编码错误(iterparse)? [英] Ignore encoding errors in Python (iterparse)?
问题描述
iterparse
解析XML字符串。但是,数据编码不正确,我不是它的提供者,所以我无法修复编码。 这是我得到的错误: p>
lxml.etree.XMLSyntaxError:line 8167:输入不正确UTF-8,表示编码!
字节:0xEA 0x76 0x65 0x73
如何忽略此错误,仍然继续解析?我不介意,如果一个角色没有正确保存,我只需要数据。
这是我尝试过的,都是从互联网上挑选的:
data = unicode(data.strip(codecs.BOM_UTF8),'utf-8',errors ='ignore') 编辑:
我无法显示网址,因为它是一个私人API并涉及我的API密钥,但这是如何我获取数据:
ur = urlopen(url)
data = ur.read()
导致问题的字符是:å
,我猜ä
&
这里是我尝试解析的部分:
def fast_iter(context,func):
为事件,elem在上下文中:
func(elem)
elem.clear()
while elem.getprevious()不是None:
del elem.getparent()[0]
del context
def process_element (elem):
print elem.xpath('title / text()')
context = etree.iterparse(StringIO(data),tag ='item')
fast_iter(context,process_element)
编辑2:
这是,当我尝试用PHP解析它时会发生什么。只要澄清,F *** ingÅmål是戏剧电影 = D
该文件以<?xml version =1.0encoding =UTF-8?>
这是从 print repr(data [offset-10:offset + 60])获得的
:
ence des r\xeaves,La< / title> \\\
\t\t< year> 2006< / year> \ n\t\t< imdb> 0354899< / imdb> \\\
您说:
导致问题的字符是:å,
你怎么知道的?你在看什么文字?
所以你不能发布URL和你的API密钥;如何读取数据,将其写入文件(以二进制模式),并发布?
当您在网络浏览器中打开该文件时,会检测到哪些编码?
至少,这样做
data.decode('utf8')#其中数据是从ur.read()获取的数据
这将产生一个异常,它将告诉您非UTF-8内容的字节偏移量。 p>
然后这样做:
print repr(data [offset-10:offset + 60 ])
并向我们显示结果。
假设编码实际上是 cp1252
并解码lxml错误消息中的字节:
> >> guff =\xEA\x76\x65\x73
>>>>来自unicodedata import name
>>>> [c(c)for c in guff.decode('1252')]
['LATIN SMALL LETTER E WITH CIRCUMFLEX','LATIN SMALL LETTER V','LATIN SMALL LE
TTER E' 'LATIN SMALL LETTER S']
>>>
所以你看到e-circumflex,其次是 ves
或a-ring,然后是 ves
或a-ring,然后是其他东西?
数据从XML声明开始,如<?xml version =1.0encoding =UTF-8?>
?如果没有,它是以什么开始的?
编码猜测/确认的线索:写入的文本是什么语言?根据提供的进一步信息,
更新。
根据您在错误附近展示的片段,电影标题是La science desrêves(梦想的科学)。
有趣的是PHP如何F *** ingÅmål,但Python阻止了法国梦。你确定你做过相同的查询吗?
你应该告诉我们这是IMDB的前面,你会得到你的答案更早。
解决方案之前,您将数据
传递给 lxml
解析器,请执行以下操作:
data = data.replace('encoding =UTF-8 ,'encoding =iso-8859-1')
这是基于他们声明的编码在他们的网站上,但也可能是谎言。在这种情况下,请尝试 cp1252
。这绝对是不是iso-8859-2 。
I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse
. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding.
Here's the error I get:
lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding !
Bytes: 0xEA 0x76 0x65 0x73
How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data.
Here's what I've tried, all picked from internet:
data = data.encode('UTF-8','ignore')
data = unicode(data,errors='ignore')
data = unicode(data.strip(codecs.BOM_UTF8), 'utf-8', errors='ignore')
Edit:
I can't show the url, as it's a private API and involves my API key, but this is how I obtain the data:
ur = urlopen(url)
data = ur.read()
The character that causes the problem is: å
, I guess that ä
& ö
, etc, would also break it.
Here's the part where I try to parse it:
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
print elem.xpath('title/text( )')
context = etree.iterparse(StringIO(data), tag='item')
fast_iter(context, process_element)
Edit 2:
This is what happens, when I try to parse it in PHP. Just to clarify, F***ing Åmål is a drama movie =D
The file starts with <?xml version="1.0" encoding="UTF-8" ?>
Here's what I get from print repr(data[offset-10:offset+60])
:
ence des r\xeaves, La</title>\n\t\t<year>2006</year>\n\t\t<imdb>0354899</imdb>\n
You say:
The character that causes the problem is: å,
How do you know that? What are you viewing your text with?
So you can't publish the URL and your API key; what about reading the data, writing it to a file (in binary mode), and publishing that?
When you open that file in your web browser, what encoding does it detect?
At the very least, do this
data.decode('utf8') # where data is what you get from ur.read()
This will produce an exception that will tell you the byte offset of the non-UTF-8 stuff.
Then do this:
print repr(data[offset-10:offset+60])
and show us the results.
Assuming the encoding is actually cp1252
and decoding the bytes in the lxml error message:
>>> guff = "\xEA\x76\x65\x73"
>>> from unicodedata import name
>>> [name(c) for c in guff.decode('1252')]
['LATIN SMALL LETTER E WITH CIRCUMFLEX', 'LATIN SMALL LETTER V', 'LATIN SMALL LE
TTER E', 'LATIN SMALL LETTER S']
>>>
So are you seeing e-circumflex followed by ves
, or a-ring followed by ves
, or a-ring followed by something else?
Does the data start with an XML declaration like <?xml version="1.0" encoding="UTF-8"?>
? If not, what does it start with?
Clues for encoding guessing/confirmation: What language is the text written in? What country?
UPDATE based on further information supplied.
Based on the snippet that you showed in the vicinity of the error, the movie title is "La science des rêves" (the science of dreams).
Funny how PHP gags on "F***ing Åmål" but Python chokes on French dreams. Are you sure that you did the same query?
You should have told us it was IMDB up front, you would have got your answer much sooner.
SOLUTION before you pass data
to the lxml
parser, do this:
data = data.replace('encoding="UTF-8"', 'encoding="iso-8859-1"')
That's based on the encoding that they declare on their website, but that may be a lie too. In that case, try cp1252
instead. It's definitely not iso-8859-2.
这篇关于忽略Python中的编码错误(iterparse)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!