将非UTF-8字符为UTF-8 [英] Converting Non-UTF-8 characters to UTF-8

查看：370 发布时间：2016/8/25 9:53:51 python c linux

本文介绍了将非UTF-8字符为UTF-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些文件，这是我的Linux系统上的present。这些文件名可以是其他的un_eng-UTF8。我想他们非utf8字符转换为UTF-8字符。我怎么能做到这一点使用C库函数或Python脚本。

I have some files which are present on my Linux system. These files names can be other the un_eng-utf8. I want to convert them from non-utf8 character to the utf-8 character. How can I do that using C library function or python scripts.

推荐答案

如果您知道是用来连接code中的文件名中的字符编码：

If you know the character encoding that is used to encode the filenames:

unicode_filename = bytestring_filename.decode(character_encoding)
utf8filename = unicode_filename.encode('utf-8')

如果您不知道字符编码再有就是在一般情况下，没有办法做到的转换，而无需丢失数据 - 非UTF8不够具体比如，如果你有一个包含文件名 b'\\ XAE字节那么它可除preTED不同，具体取决于文件名编码 - 它的 u'® <在 CP1252 编码，但相同的字节的再presents > U'«' 在 CP437 。有模块，如的chardet ，让您猜的字符编码，但它仅仅是猜测的 - 的没有没有这样的东西作为纯文本。

If you don't know the character encoding then there is no way in the general case to do the conversion without loosing data -- "non-utf8" is not specific enough e.g., if you have a filename that contains b'\xae' byte then it can be interpreted differently depending on the filename encoding -- it is u'®' in cp1252 encoding but the same byte represents u'«' in cp437. There are modules such as chardet that allow you to guess the character encoding but it is only a guess -- "There Ain't No Such Thing as Plain Text."

这篇关于将非UTF-8字符为UTF-8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将非UTF-8字符为UTF-8 [英] Converting Non-UTF-8 characters to UTF-8

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

将非UTF-8字符为UTF-8 [英] Converting Non-UTF-8 characters to UTF-8

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭