Windows-1252 转 UTF-8 编码 [英] Windows-1252 to UTF-8 encoding
问题描述
我已将某些文件从 Windows 机器复制到 Linux 机器.所以所有的 Windows 编码 (windows-1252) 文件都需要转换为 UTF-8.已经在 UTF-8 中的文件不应更改.我打算为此使用 recode
实用程序.我如何指定 recode
实用程序应该只转换 windows-1252 编码的文件而不是 UTF-8 文件?
I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode
utility for that. How can I specify that the recode
utility should only convert windows-1252 encoded files and not the UTF-8 files?
重新编码的使用示例:
recode windows-1252.. myfile.txt
这会将 myfile.txt
从 windows-1252 转换为 UTF-8.在这样做之前,我想知道 myfile.txt
实际上是 windows-1252 编码而不是 UTF-8 编码.否则,我相信这会损坏文件.
This would convert myfile.txt
from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt
is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.
推荐答案
您希望重新编码如何知道文件是 Windows-1252?理论上,我相信任何文件都是有效的 Windows-1252 文件,因为它将每个可能的字节映射到一个字符.
How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
现在肯定有一些特征会强烈建议它是 UTF-8 - 例如,如果它以 UTF-8 BOM 开头 - 但它们不会是确定的.
Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.
一种选择是首先检测它是否实际上是一个完全有效的 UTF-8 文件,我想……再说一次,这只是暗示性的.
One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.
我不熟悉重新编码工具本身,但您可能想看看它是否能够将文件重新编码为 相同 编码 - 如果您使用无效文件(即包含无效 UTF-8 字节序列的一个)它很可能将无效序列转换为问号或类似的东西.此时,您可以通过将文件重新编码为 UTF-8 并查看输入和输出是否相同来检测文件是否为有效的 UTF-8.
I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.
或者,以编程方式执行此操作,而不是使用重新编码实用程序 - 例如,这在 C# 中非常简单.
Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.
不过重申一下:所有这些都是启发式的.如果您真的不知道文件的编码,那么没有什么能 100% 准确地告诉您.
Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.
这篇关于Windows-1252 转 UTF-8 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!