Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符 [英] Python zipfile module - zipfile.write() file with turkish chars in filename

查看:838
本文介绍了Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的系统上,有许多Word文档,我想使用Python模块zipfile压缩它们.

我找到了此解决方案问题,但是在我的系统上,文件名中包含德国变音符土耳其语字符.

我从解决方案这样,它就可以处理文件名中的德国变音符:

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            current_file = os.path.join(root, file)
            print "Adding to archive -> file: "+str(current_file)
            try:
                #ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
                ziph.write(current_file.encode("utf-8")) #both not ok
                #ziph.write(current_file.decode("utf-8")) #both not ok
            except Exception,ex:
                print "exception ---> "+str(ex)
                print repr(current_file)
                raise

不幸的是,我尝试包含土耳其语字符的逻辑仍然失败,留下了一个问题,即每次文件名包含土耳其语字符时,代码都会显示异常,例如这个:

exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'

我尝试了几种字符串编码解码的东西,但是都没有成功.

有人可以在这里帮助我吗?


我编辑了上面的代码以包含注释中提到的更改.

现在显示以下错误:

...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
  File "Backup.py", line 48, in <module>
    zipdir('X:\\my\\path', zipf)
  File "Backup.py", line 12, in zipdir
    ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
 not in range(128)

³实际上是德语ü.


编辑

尝试了评论中建议的内容后,我无法解决问题.

因此,我切换到了Groovy编程语言,并使用了其Zip功能.

由于这是基于意见的讨论,因此我决定投票关闭该线程.

解决方案

如果以后不需要使用任何存档程序检查ZIP文件,则可以始终将其编码为base64,然后在使用Python解压缩时将其还原. /p>

对于任何存档者而言,这些文件名看起来都是胡言乱语,但会保留编码.

无论如何,要获取0-128 ASCII范围字符串(或Py3中的bytes对象),您必须编码(),而不是解码().

encode()将unicode()字符串序列化为ASCII范围.

>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'

decode()从此返回到unicode():

>>> "\xc5\xa1blah".decode("utf-8")
u'\u0161blah'

其他代码页也是如此.

很抱歉强调这一点,但是人们有时会对编码和解码内容感到困惑.

如果需要文件,但是您对保留变音符号和其他符号的关注度很高,则可以使用:

u"üsdlakui".encode("utf-8", "replace")

或:

u"üsdlakui".encode("utf-8", "ignore")

这会将未知字符替换为可能的字符,或者完全忽略任何解码/编码错误.

如果引发的错误类似于UnicodeDecodeError:无法解码字符...

,这将解决问题.

但是,问题将出在仅由非拉丁字符组成的文件名中.

现在一些可能有效的方法:

好吧

'Sömethüng'.encode("utf-8")

必然会引起"ASCII编码错误",因为在字符串中没有定义unicode字符,而应使用othervise用来描述unicode/UTF-8字符的非拉丁字符,但定义为ASCII-文件本身是不是UTF-8编码的.

同时:

# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")

# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")

具有在文件顶部定义的编码并保存为UTF-8编码的编码应该起作用.

是的,您确实有来自OS(文件名)的字符串,但这从故事开始就存在问题.

即使编码正确通过,也有ZIP的问题有待解决.

按照规范,ZIP应该使用CP437存储文件名,但这很少如此.

大多数存档器使用默认的OS编码(Python中为MBCS).

并且大多数存档器不支持UTF-8.因此,我在这里提出的建议应该起作用,但不适用于所有存档器.

要告诉ZIP归档程序归档文件正在使用UTF-8文件名,请将flag_bits的第11位设置为True.正如我说的,其中一些不检查该位.这是ZIP规范中的最新内容. (嗯,确实是几年前)

在这里,我不会写完整的代码,而仅是理解事物所需的部分.

# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()

我没有测试它,只是编写了一个代码,但这是一个主意,即使某个地方出现了一些错误.

如果这行不通,我不知道会怎样.

On my system there are many Word documents and I want to zip them using the Python module zipfile.

I have found this solution to my problem, but on my system there are files which contain German umlauts and Turkish characters in their filename.

I have adapted the method from the solution like this, so it can process German umlauts in the filenames:

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            current_file = os.path.join(root, file)
            print "Adding to archive -> file: "+str(current_file)
            try:
                #ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
                ziph.write(current_file.encode("utf-8")) #both not ok
                #ziph.write(current_file.decode("utf-8")) #both not ok
            except Exception,ex:
                print "exception ---> "+str(ex)
                print repr(current_file)
                raise

Unfortunately my attempts to include logic for Turkish characters remained unsuccessful, leaving the problem that every time a filename contains a Turkish character the code prints an exception, for example like this:

exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'

I have tried several string encode-decode stuff, but none of it was successful.

Can someone help me out here?


I edited the above code to include the changes mentioned in the comment.

The following errors are now shown:

...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
  File "Backup.py", line 48, in <module>
    zipdir('X:\\my\\path', zipf)
  File "Backup.py", line 12, in zipdir
    ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
 not in range(128)

The ³ is actually a German ü.


EDIT

After trying the suggested things in the comments, I could not work out a solution.

Therefore I switched to the Groovy Programming Language and used its Zip-Capabilities.

As this is a opinion-based discussion, I have decided to vote for closing the thread.

解决方案

If you do not need to inspect the ZIP file with any archiver later, you may always encode it to base64, and then restore them when extracting with Python.

To any archiver these filenames will look like gibberish but encoding will be preserved.

Anyway, to get the 0-128 ASCII range string (or bytes object in Py3), you have to encode(), not decode().

encode() serializes the unicode() string to ASCII range.

>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'

decode() returns from that to unicode():

>>> "\xc5\xa1blah".decode("utf-8")
u'\u0161blah'

Same goes for any other codepage.

Sorry for emphasizing that, but people sometimes get confused about encoding and decoding stuff.

If you need files, but you arent concerned much about preserving umlautes and other symbols, you can use:

u"üsdlakui".encode("utf-8", "replace")

or:

u"üsdlakui".encode("utf-8", "ignore")

This will replace unknown characters with possible ones or totally ignore any decoding/encoding errors.

That will fix things if the raised error is something like UnicodeDecodeError: Cannot decode character ...

But, the problem will be with filenames consisting only of non-latin characters.

Now something that might actually work:

Well,

'Sömethüng'.encode("utf-8")

is bound to raise "ASCII encode error" as there is no unicode characters defined in the string while non-latin characters that othervise should be used to describe unicode/UTF-8 character are used but defined as ASCII - file itself is not UTF-8 encoded.

while:

# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")

or

# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")

with encoding defined on top of file and saved as UTF-8 encoded should work.

Yes, you do have strings from OS (filename), but that is a problem from beginning of the story.

Even if encoding passes right, there is the ZIP thing still to be solved.

By specification ZIP should store filenames using CP437, but this is rarely so.

Most archivers use the default OS encoding (MBCS in Python).

And most archivers doesn't support UTF-8. So, what I propose here should work, but not on all archivers.

To tell the ZIP archiver that archive is using UTF-8 filenames, the eleventh bit of flag_bits should be set to True. As I said, some of them does not check that bit. This is recent thing in ZIP spec. (Well, few years ago really)

I won't write here whole code, just the part needed to understand the thing.

# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()

I didn't test it, just wrote a code, but this is an idea, even if somewhere crept in some bug.

If this doesn't work, I don't know what will.

这篇关于Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆