再次:UnicodeEncodeError:ascii编解码器无法编码 [英] Again: UnicodeEncodeError: ascii codec can't encode

查看:133
本文介绍了再次:UnicodeEncodeError:ascii编解码器无法编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要解析的XML文件文件夹.我需要从这些文件的元素中获取文本.它们将被收集并打印到CSV文件中,其中元素在各列中列出.

I have a folder of XML files that I would like to parse. I need to get text out of the elements of these files. They will be collected and printed to a CSV file where the elements are listed in columns.

现在实际上可以对我的文件中的 some 执行此操作.就是说,对于我的许多XML文件,该过程进行得很好,并且我得到了想要的输出.做到这一点的代码是:

I can actually do this right now for some of my files. That is, for many of my XML files, the process goes fine, and I get the output I want. The code that does this is:

import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
    #getting the basic structure
    tree = ET.ElementTree(file=doc)
    root = tree.getroot()
    agencycodes = []
    rins = []
    titles =[]
    elements = [agencycodes, rins, titles]
    #pulling in the text from the fields
    for elem in tree.iter():
        if elem.tag == "AGENCY_CODE":
            agencycodes.append(int(elem.text))
        elif elem.tag == "RIN":
            rins.append(elem.text)
        elif elem.tag == "TITLE":
            titles.append(elem.text)
    with open('parsetest.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(zip(*elements))


parseEO('EO_file.xml')     

但是,在某些版本的输入文件中,我得到了臭名昭著的错误:

However, on some versions of the input file, I get the infamous error:

'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

完整的回溯是:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE

/Users/ian/Desktop/parsingtest.py in <module>()
     91         writer.writerows(zip(*elements))
     92 
---> 93 parseEO('/EO_file.xml')
     94 
     95 

/parsingtest.py in parseEO(doc)
     89     with open('parsetest.csv', 'w') as f:
     90         writer = csv.writer(f)
---> 91         writer.writerows(zip(*elements))
     92 
     93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

通过阅读其他线程,我很确定问题出在正在使用的编解码器中(并且,您也知道错误也很清楚).但是,我所阅读的解决方案对 me 并没有帮助(强调是因为我了解我是问题的根源,而不是人们过去的回答方式).

I am fairly confident from reading the other threads that the problem is in the codec being used (and, you know, the error is pretty clear on that as well). However, the solutions I have read haven't helped me (emphasized because I understand I am the source of the problem, not the way people have answered in the past).

几个答复(例如:这个)没有直接处理ElementTree,而且我不确定如何处理将解决方案转化为我正在做的事情.

Several repsonses (such as: this one and this one and this one) don't deal directly with ElementTree, and I'm not sure how to translate the solutions into what I'm doing.

其他处理ElementTree的解决方案(例如:这一个)正在使用短字符串(此处为第一个链接)或正在使用.tostring/.fromstring在ElementTree中的方法,我没有. (当然,也许我应该是.)

Other solutions that do deal with ElementTree (such as: this one and this one) are either using a short string (the first link here) or are using the .tostring/.fromstring methods in ElementTree which I do not. (Though, of course, perhaps I should be.)

我尝试过的不起作用:

  1. 我试图通过UTF-8编码导入文件:

  1. I have attempted to bring in the file via UTF-8 encoding:

infile = codecs.open('/EO_file.xml', encoding="utf-8")
parseEO(infile)

但是我认为ElementTree进程已经将其理解为UTF-8(在我拥有的所有XML文件的第一行中都已指出),因此这不仅不正确,而且遍及整个地方实际上是多余的再次.

but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again.

我试图在循环中声明一个编码过程,替换为:

I attempted to declare an encoding process within the loop, replacing:

tree = ET.ElementTree(file=doc)

parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse(doc, parser=parser)

在上面的循环中

起作用.这对我也不起作用.之前起作用的相同文件仍然起作用,造成错误的相同文件仍然造成了错误.

in the loop above that does work. This didn't work for me either. The same files that worked before still worked, the same files that created the error still created the error.

还有很多其他随机尝试,但我不会为此而感到困惑.

There have been a lot of other random attempts, but I won't belabor the point.

因此,尽管我假设我拥有的代码既效率低下又不利于良好的编程风格,但它确实可以满足我对多个文件的要求.我试图了解是否只是一个我不知道的遗漏参数,是否应该以某种方式对文件进行预处理(我尚未确定有问题的字符在哪里,但确实知道u'\ x97转换为某种控制字符)或其他选项.

So, while I assume the code I have is both inefficient and offensive to good programming style, it does do what I want for several files. I am trying to understand if there is simply an argument I'm missing that I don't know about, if I should somehow pre-process the files (I have not identified where the offending character is, but do know that u'\x97 translates to a control character of some kind), or some other option.

推荐答案

您正在解析XML; XML API将为您提供unicode值.然后,您尝试将unicode数据写入 而不先对其进行编码的CSV文件.然后,Python尝试为您编码,但失败.您可以在回溯中看到这一点,它是.writerows()调用失败,并且错误告诉您 encoding 失败了,而不是解码(解析XML).

You are parsing XML; the XML API hands you unicode values. You are then attempting to write the unicode data to a CSV file without encoding it first. Python then attempts to encode it for you but fails. You can see this in your traceback, it is the .writerows() call that fails, and the error tells you that encoding is failing, and not decoding (parsing the XML).

您需要选择一种编码,然后在写入之前对数据进行编码:

You need to choose an encoding, then encode your data before writing:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(elem.text.encode('utf8'))
    elif elem.tag == "TITLE":
        titles.append(elem.text.encode('utf8'))

我使用UTF8编码是因为它可以处理任何Unicode代码点,但是您需要做出自己的明确选择.

I used the UTF8 encoding because it can handle any Unicode code point, but you need to make your own, explicit choice.

这篇关于再次:UnicodeEncodeError:ascii编解码器无法编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆