Windows-1252到UTF-8编码 [英] Windows-1252 to UTF-8 encoding

查看:244
本文介绍了Windows-1252到UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将某些文件从Windows机器复制到Linux机器。所以所有的windows编码(windows-1252)文件需要转换为UTF-8。已经在UTF-8中的文件不应该更改。我计划使用recode实用程序。如何指定recode实用程序应该只转换Windows-1252编码文件而不是UTF-8文件。

I've copied certain files from a Windows machine to a Linux machine. So all the windows encoded(windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the "recode" utility for that. How can I specify that the "recode" utility should only convert windows-1252 encoded files and not the UTF-8 files.

重编码的使用示例:
recode windows-1252 .. myfile.txt

Example usage of recode: recode windows-1252.. myfile.txt

将myfile.txt从Windows-1252转换为UTF-8。在这之前我想知道如果myfile.txt实际上是windows-1252编码,而不是UTF-8编码。否则,我相信这会损坏的文件。

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this I would like to know if myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

推荐答案

您希望重新编码知道一个文件是Windows-1252?在理论上,我认为任何文件是有效的Windows-1252文件,因为它将每个可能的字节映射到一个字符。

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

肯定会强烈地建议它是UTF-8的特性 - 如果它以UTF-8 BOM开头,但它们不会是明确的。

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

一个选择是检测它是否实际上是一个完全有效的UTF-8文件,我想再次,这只是提示。

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

我不熟悉recode工具本身,但你可能想要看看它是否能够从相同编码重新编码一个文件 - 如果你这样做一个无效的文件(即一个它包含无效的UTF-8字节序列),它可以很好地将无效序列转换为问号或类似的东西。此时,您可以通过将文件重新编码为UTF-8并检查输入和输出是否相同来检测文件是否为有效的UTF-8。

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

编程方式,而不是使用recode实用程序 - 这将是相当简单的C#,例如。

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

只是重申一遍:所有这一切都是启发式的。如果你真的不知道文件的编码,没有什么可以告诉你它与100%的准确性。

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

这篇关于Windows-1252到UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆