在不知道代码页的情况下将原始字节字符串转换为Unicode [英] Convert raw byte string to Unicode without knowing the codepage beforehand

查看:123
本文介绍了在不知道代码页的情况下将原始字节字符串转换为Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用右键单击菜单上下文时,Windows将文件路径作为原始(字节)字符串类型传递.

例如:

path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'

我的应用程序中的许多外部软件包都需要unicode类型的字符串,因此我必须将其转换为unicode.

如果我们事先知道原始字符串的编码(在示例中为cp1255),那将很容易.但是我不知道世界各地的每台计算机将在本地使用哪种编码.

如何将string转换为unicode?也许需要使用win32api?

解决方案

不知道为什么可能会得到DOS代码页(862)而不是ANSI(1255)-右键单击选项是如何设置的?

无论哪种方式-如果您需要在参数中接受任意Unicode字符,则无法从Python 2的sys.argv中做到.此列表是由Win32 API的非Unicode版本(GetCommandLineA)返回的字节填充的,并且该编码永远都不是Unicode安全的.

许多其他语言,包括Java和Ruby都在同一条船上;限制来自Microsoft C运行时对C标准库函数的实现.要解决此问题,可以在Windows上调用Unicode版本(GetCommandLineW),而不是依赖于跨平台标准库. Python 3可以做到这一点.

同时,对于Python 2,您可以通过自己调用GetCommandLineW来做到这一点,但这并不是特别漂亮.如果要使用Windows样式的参数splittng,也可以使用CommandLineToArgvW.您可以使用win32扩展名,也可以仅使用ctypes扩展名.

示例(尽管最好跳过将Unicode字符串编码回UTF-8字节的步骤.

When using the right-click menu context, windows passes file path as raw (byte) string type.

For example:

path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'

Many external packages in my application are expecting unicode type strings, so I have to convert it into unicode.

That would be easy if we'd known the raw string's encoding beforehand (In the example, it is cp1255). However I can't know which encoding will be used locally on each computer around the world.

How can I convert the string into unicode? Perhaps using win32api is needed?

解决方案

No idea why you might be getting the DOS code page (862) instead of ANSI (1255) - how is the right-click option set up?

Either way - if you need to accept any arbitrary Unicode character in your arguments you can't do it from Python 2's sys.argv. This list is populated from the bytes returned by the non-Unicode version of the Win32 API (GetCommandLineA), and that encoding is never Unicode-safe.

Many other languages including Java and Ruby are in the same boat; the limitation comes from the Microsoft C runtime's implementations of the C standard library functions. To fix it, one would call the Unicode version (GetCommandLineW) on Windows instead of relying on the cross-platform standard library. Python 3 does this.

In the meantime for Python 2, you can do it by calling GetCommandLineW yourself but it's not especially pretty. You can also use CommandLineToArgvW if you want Windows-style parameter splittng. You can do this with win32 extensions or also just plain ctypes.

Example (though the step of encoding the Unicode string back to UTF-8 bytes is best skipped).

这篇关于在不知道代码页的情况下将原始字节字符串转换为Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆