Python无法重现UnicodeDecodeError [英] Python Unreproducible UnicodeDecodeError

查看:92
本文介绍了Python无法重现UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python中的以下命令序列替换Word文件中的子字符串.即使单独使用完全相同的Word文件,单独的代码也可以正常工作,但是将其嵌入到较大规模的项目结构中时,则会在该位置产生错误.我对导致它的原因一无所知,因为它似乎与代码无关,并且对我而言似乎是不可复制的.

I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when embedding it in a larger scale project structure, it throws an error at exact that spot. I'm clueless as to what causes it, as it seemingly has nothing to do with the code and seems unreproducible for me.

旁注:我知道是什么导致了错误,它在Word文件中是德语的'ü',但它是必需的,如果代码可以独立运行,则删除它似乎不是正确的解决方案.

Side note: I know what's causing the Error, it's a german 'ü' in the Word file, but it's needed and removing it doesn't seem like the right solution, if the code works standalone.

#foo.py
from bar import make_wordm
def main(uuid):
    with open('foo.docm', 'w+') as f:
        f.write(make_wordm(uuid=uuid))

main('1cb02f34-b331-4616-8d20-aa1821ef0fbd')

foo.py导入bar.py来完成繁重的工作.

foo.py imports bar.py for doing the heavy lifting.

#bar.py
import tempfile
import shutil
from cStringIO import StringIO
from zipfile import ZipFile, ZipInfo

WORDM_TEMPLATE='./res/template.docm'
MODE_DIRECTORY = 0x10

def zipinfo_contents_replace(zipfile=None, zipinfo=None,
                             search=None, replace=None):
    dirname = tempfile.mkdtemp()
    fname = zipfile.extract(zipinfo, dirname)
    with open(fname, 'r') as fd:
        contents = fd.read().replace(search, replace)
    shutil.rmtree(dirname)
    return contents

def make_wordm(uuid=None, template=WORDM_TEMPLATE):
    with open(template, 'r') as f:
        input_buf = StringIO(f.read())
    output_buf = StringIO()
    output_zip = ZipFile(output_buf, 'w')

    with ZipFile(input_buf, 'r') as doc:
        for entry in doc.filelist:
            if entry.external_attr & MODE_DIRECTORY:
                continue

            contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
                                        search="00000000-0000-0000-0000-000000000000"
                                        , replace=uuid)
            output_zip.writestr(entry, contents)
    output_zip.close()
    return output_buf.getvalue()

在较大规模的上下文中嵌入相同的代码时,会引发以下错误:

The following error is thrown when embedding the same code in a larger scale context:

ERROR:root:message
Traceback (most recent call last):
  File "FooBar.py", line 402, in foo_bar
    bar = bar_constructor(bar_theme,bar_user,uuid)
  File "FooBar.py", line 187, in bar_constructor
    if(main(uuid)):
  File "FooBar.py", line 158, in main
    f.write(make_wordm(uuid=uuid))
  File "/home/foo/FooBarGen.py", line 57, in make_wordm
    search="00000000-0000-0000-0000-000000000000", replace=uuid)
  File "/home/foo/FooBarGen.py", line 24, in zipinfo_contents_replace
    contents = fd.read().replace(search, replace)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
INFO:FooBar:None

edit:经过进一步检查和调试,似乎变量'uuid'引起了问题.当将参数提供为全文字符串('1cb02f34-b331-4616-8d20-aa1821ef0fbd')时,与其使用从JSON解析的变量,它还可以很好地工作.

edit: Upon further examination and debugging, it seems like the variable 'uuid' is causing the issue. When giving the parameter as a fulltext string ('1cb02f34-b331-4616-8d20-aa1821ef0fbd'), instead of using the variable parsed from a JSON, it works perfectly fine.

edit2:我必须添加uuid = uuid.encode('utf-8', 'ignore'),它现在可以正常工作了.

edit2: I had to add uuid = uuid.encode('utf-8', 'ignore') and it works perfectly fine now.

推荐答案

问题是Unicode和字节字符串混合在一起. Python 2有帮助"地尝试从一种转换为另一种,但默认使用ascii编解码器.

The problem is mixing Unicode and byte strings. Python 2 "helpfully" tries to convert from one to the other but defaults to using the ascii codec.

这是一个例子:

>>> 'aeioü'.replace('a','b')  # all byte strings
'beio\xfc'
>>> 'aeioü'.replace(u'a','b') # one Unicode string and it converts...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 4: ordinal not in range(128)

您提到从JSON读取UUID. JSON返回Unicode字符串.理想情况下,读取所有解码为Unicode的文本文件,以Unicode进行所有文本处理,并在写回存储时对文本文件进行编码.在您的更大的框架"中,这可能是一项艰巨的移植工作,但实际上使用io.open进行编码以读取文件并解码为Unicode:

You mentioned reading a UUID from JSON. JSON returns Unicode strings. Ideally read all text files decoding to Unicode, do all text processing in Unicode, and encode text files when writing back to storage. In your "larger framework" this could be a big porting job, but essentially use io.open with an encoding to read a file and decode to Unicode:

with io.open(fname, 'r', encoding='utf8') as fd:
    contents = fd.read().replace(search, replace)

请注意,encoding应该与您正在读取的文件的实际编码相匹配.这是您必须确定的事情.

Note that encoding should match the actual encoding of the files you are reading. That's something you'll have to determine.

您在编辑中发现的一种快捷方式是将JSON中的UUID编码回一个字节字符串,但是使用Unicode处理文本应该是目标.

A shortcut, as you've found in your edit, is to encode the UUID from JSON back to a byte string, but using Unicode to deal with text should be the goal.

Python 3通过默认情况下使字符串为Unicode来清理此过程,并将隐式转换删除为字节/Unicode字符串.

Python 3 cleans up this process by making strings Unicode by default, and drops the implicit conversion to/from byte/Unicode strings.

这篇关于Python无法重现UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆