如何在zip中打开一个unicode文本文件? [英] How to open an unicode text file inside a zip?

查看:180
本文介绍了如何在zip中打开一个unicode文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用zipfile.ZipFile(5.csv.zip,r)作为zfile:$试用

  b $ b作为zfile.namelist()中的名称:
与zfile.open(name,'rU')作为readFile:
line = readFile.readline()
print(line)
split = line.split('\t')

答案:

  b'$ 0.0\t1822\t1\t1\t1\\\
'
追踪(最近一次呼叫最后)
文件zip.py,第6行
split = line.split('\t')
TypeError:类型str不支持缓冲区API

如何以unicode打开文本文件而不是 b ?对于Python 3来说,使用 io.TextIOWrapper 正如JF塞巴斯蒂安所描述的那样是最好的选择。下面的答案仍然可以帮助2.x.即使对于3.x,我也不认为下面的东西实际上是不正确的,但是 io.TestIOWrapper 还是比较好。



<如果这个文件是utf-8,那么这个文件就可以工作了:
$ b $ pre $ #其余代码如上,然后:
与zfile.open(name,'rU')作为readFile:
line = readFile.readline()。decode('utf8')
#etc
codecs.iterdecode
$ b> ,但是对于 readline()

 不起作用zfile.open(name,'rU')as readFile:
for codecs.iterdecode(readFile,'utf8'):
print line
#etc

请注意,对于多字节编码,两种方法都不一定安全。例如,little-endian UTF-16表示换行符,字节为 b'\x0A\x00'。寻找换行符的不支持unicode的工具会将其错误地分割,在下一行留下空字节。在这种情况下,你必须使用一些不会试图通过换行符来分割输入的东西,比如 ZipFile.read ,然后将整个字节串解码为一旦。这不是UTF-8的问题。


I tried

with zipfile.ZipFile("5.csv.zip", "r") as zfile:
    for name in zfile.namelist():
        with zfile.open(name, 'rU') as readFile:
                line = readFile.readline()
                print(line)
                split = line.split('\t')

it answers:

b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
    split = line.split('\t')
TypeError: Type str doesn't support the buffer API

How to open the text file as unicode instead of as b?

解决方案

edit For Python 3, using io.TextIOWrapper as J. F. Sebastian describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.

If the file is utf-8, this will work:

# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
    line = readFile.readline().decode('utf8')
    # etc

If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().

with zfile.open(name, 'rU') as readFile:
    for line in codecs.iterdecode(readFile, 'utf8'):
        print line
        # etc

Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.

这篇关于如何在zip中打开一个unicode文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆