忽略Python中的编码错误(iterparse)? [英] Ignore encoding errors in Python (iterparse)?

查看:939
本文介绍了忽略Python中的编码错误(iterparse)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在和这个战斗一个小时。我正在使用 iterparse 解析XML字符串。但是,数据编码不正确,我不是它的提供者,所以我无法修复编码。



这是我得到的错误: p>

  lxml.etree.XMLSyntaxError:line 8167:输入不正确UTF-8,表示编码! 
字节:0xEA 0x76 0x65 0x73

如何忽略此错误,仍然继续解析?我不介意,如果一个角色没有正确保存,我只需要数据。



这是我尝试过的,都是从互联网上挑选的:

data = unicode(data.strip(codecs.BOM_UTF8),'utf-8',errors ='ignore')

编辑:

我无法显示网址,因为它是一个私人API并涉及我的API密钥,但这是如何我获取数据:

  ur = urlopen(url)
data = ur.read()

导致问题的字符是:å,我猜ä&



这里是我尝试解析的部分:

  def fast_iter(context,func):
为事件,elem在上下文中:
func(elem)
elem.clear()
while elem.getprevious()不是None:
del elem.getparent()[0]
del context

def process_element (elem):
print elem.xpath('title / text()')

context = etree.iterparse(StringIO(data),tag ='item')
fast_iter(context,process_element)

编辑2:

这是,当我尝试用PHP解析它时会发生什么。只要澄清,F *** ingÅmål是戏剧电影 = D



该文件以<?xml version =1.0encoding =UTF-8?>



这是从 print repr(data [offset-10:offset + 60])获得的

  ence des r\xeaves,La< / title> \\\
\t\t< year> 2006< / year> \ n\t\t< imdb> 0354899< / imdb> \\\


解决方案

您说:


导致问题的字符是:å,


你怎么知道的?你在看什么文字?



所以你不能发布URL和你的API密钥;如何读取数据,将其写入文件(以二进制模式),并发布?



当您在网络浏览器中打开该文件时,会检测到哪些编码?



至少,这样做

  data.decode('utf8')#其中数据是从ur.read()获取的数据

这将产生一个异常,它将告诉您非UTF-8内容的字节偏移量。 p>

然后这样做:



print repr(data [offset-10:offset + 60 ])



并向我们显示结果。



假设编码实际上是 cp1252 并解码lxml错误消息中的字节:

 > >> guff =\xEA\x76\x65\x73
>>>>来自unicodedata import name
>>>> [c(c)for c in guff.decode('1252')]
['LATIN SMALL LETTER E WITH CIRCUMFLEX','LATIN SMALL LETTER V','LATIN SMALL LE
TTER E' 'LATIN SMALL LETTER S']
>>>

所以你看到e-circumflex,其次是 ves 或a-ring,然后是 ves 或a-ring,然后是其他东西?



数据从XML声明开始,如<?xml version =1.0encoding =UTF-8?> ?如果没有,它是以什么开始的?



编码猜测/确认的线索:写入的文本是什么语言?根据提供的进一步信息,



更新



根据您在错误附近展示的片段,电影标题是La science desrêves(梦想的科学)。



有趣的是PHP如何F *** ingÅmål,但Python阻止了法国梦。你确定你做过相同的查询吗?



你应该告诉我们这是IMDB的前面,你会得到你的答案更早。



解决方案之前,您将数据传递给 lxml 解析器,请执行以下操作:

  data = data.replace('encoding =UTF-8 ,'encoding =iso-8859-1')

这是基于他们声明的编码在他们的网站上,但也可能是谎言。在这种情况下,请尝试 cp1252 。这绝对是不是iso-8859-2


I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding.

Here's the error I get:

lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding !
Bytes: 0xEA 0x76 0x65 0x73

How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data.

Here's what I've tried, all picked from internet:

data = data.encode('UTF-8','ignore')
data = unicode(data,errors='ignore')
data = unicode(data.strip(codecs.BOM_UTF8), 'utf-8', errors='ignore')

Edit:
I can't show the url, as it's a private API and involves my API key, but this is how I obtain the data:

ur = urlopen(url)
data = ur.read()

The character that causes the problem is: å, I guess that ä & ö, etc, would also break it.

Here's the part where I try to parse it:

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem):
    print elem.xpath('title/text( )')

context = etree.iterparse(StringIO(data), tag='item')
fast_iter(context, process_element)

Edit 2:
This is what happens, when I try to parse it in PHP. Just to clarify, F***ing Åmål is a drama movie =D

The file starts with <?xml version="1.0" encoding="UTF-8" ?>

Here's what I get from print repr(data[offset-10:offset+60]):

ence des r\xeaves, La</title>\n\t\t<year>2006</year>\n\t\t<imdb>0354899</imdb>\n

解决方案

You say:

The character that causes the problem is: å,

How do you know that? What are you viewing your text with?

So you can't publish the URL and your API key; what about reading the data, writing it to a file (in binary mode), and publishing that?

When you open that file in your web browser, what encoding does it detect?

At the very least, do this

data.decode('utf8') # where data is what you get from ur.read()

This will produce an exception that will tell you the byte offset of the non-UTF-8 stuff.

Then do this:

print repr(data[offset-10:offset+60])

and show us the results.

Assuming the encoding is actually cp1252 and decoding the bytes in the lxml error message:

>>> guff = "\xEA\x76\x65\x73"
>>> from unicodedata import name
>>> [name(c) for c in guff.decode('1252')]
['LATIN SMALL LETTER E WITH CIRCUMFLEX', 'LATIN SMALL LETTER V', 'LATIN SMALL LE
TTER E', 'LATIN SMALL LETTER S']
>>>

So are you seeing e-circumflex followed by ves, or a-ring followed by ves, or a-ring followed by something else?

Does the data start with an XML declaration like <?xml version="1.0" encoding="UTF-8"?>? If not, what does it start with?

Clues for encoding guessing/confirmation: What language is the text written in? What country?

UPDATE based on further information supplied.

Based on the snippet that you showed in the vicinity of the error, the movie title is "La science des rêves" (the science of dreams).

Funny how PHP gags on "F***ing Åmål" but Python chokes on French dreams. Are you sure that you did the same query?

You should have told us it was IMDB up front, you would have got your answer much sooner.

SOLUTION before you pass data to the lxml parser, do this:

data = data.replace('encoding="UTF-8"', 'encoding="iso-8859-1"')

That's based on the encoding that they declare on their website, but that may be a lie too. In that case, try cp1252 instead. It's definitely not iso-8859-2.

这篇关于忽略Python中的编码错误(iterparse)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆