如何在zip中打开一个unicode文本文件? [英] How to open an unicode text file inside a zip?
问题描述
b $ b作为zfile.namelist()中的名称:
与zfile.open(name,'rU')作为readFile:
line = readFile.readline()
print(line)
split = line.split('\t')
答案:
b'$ 0.0\t1822\t1\t1\t1\\\
'
追踪(最近一次呼叫最后)
文件zip.py,第6行
split = line.split('\t')
TypeError:类型str不支持缓冲区API
如何以unicode打开文本文件而不是 b
?对于Python 3来说,使用 io.TextIOWrapper $ 正如JF塞巴斯蒂安所描述的那样是最好的选择。下面的答案仍然可以帮助2.x.即使对于3.x,我也不认为下面的东西实际上是不正确的,但是
io.TestIOWrapper
还是比较好。
<如果这个文件是utf-8,那么这个文件就可以工作了:
$ b $ pre $
#其余代码如上,然后:
与zfile.open(name,'rU')作为readFile:
line = readFile.readline()。decode('utf8')
#etc
$ c $如果你打算迭代这个文件,你可以使用 codecs.iterdecode
$ b> ,但是对于 readline()
。
不起作用zfile.open(name,'rU')as readFile:
for codecs.iterdecode(readFile,'utf8'):
print line
#etc
请注意,对于多字节编码,两种方法都不一定安全。例如,little-endian UTF-16表示换行符,字节为 b'\x0A\x00'
。寻找换行符的不支持unicode的工具会将其错误地分割,在下一行留下空字节。在这种情况下,你必须使用一些不会试图通过换行符来分割输入的东西,比如 ZipFile.read
,然后将整个字节串解码为一旦。这不是UTF-8的问题。
I tried
with zipfile.ZipFile("5.csv.zip", "r") as zfile:
for name in zfile.namelist():
with zfile.open(name, 'rU') as readFile:
line = readFile.readline()
print(line)
split = line.split('\t')
it answers:
b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
split = line.split('\t')
TypeError: Type str doesn't support the buffer API
How to open the text file as unicode instead of as b
?
解决方案 edit For Python 3, using io.TextIOWrapper
as J. F. Sebastian describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper
is still better.
If the file is utf-8, this will work:
# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc
If you're going to be iterating over the file you can use codecs.iterdecode
, but that won't work with readline()
.
with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc
Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'
. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read
, and then decode the whole byte string at once. This is not a concern for utf-8.
这篇关于如何在zip中打开一个unicode文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!