Windows-1252为UTF-8编码 [英] Windows-1252 to UTF-8 encoding

查看:2247
本文介绍了Windows-1252为UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将某些文件从Windows机器复制到Linux机器。所以所有Windows编码(Windows-1252)文件需要转换为UTF-8。已经在UTF-8中的文件不应该更改。我打算使用 recode 实用程序。如何指定 recode 实用程序只能转换Windows-1252编码文件而不是UTF-8文件?



重新编码的示例使用:

  recode windows-1252 .. myfile.txt 

这将将$ code> myfile.txt 从Windows-1252转换为UTF-8。在这之前,我想知道, myfile.txt 实际上是windows-1252编码而不是UTF-8编码。否则,我相信这会损坏文件。

解决方案

您如何期望重新编码知道文件是Windows-1252?理论上,我相信任何文件是一个有效的Windows-1252文件,因为它将每个可能的字节映射到一个字符。



现在有肯定会强烈地表示的特征,就是UTF-8,如果它以UTF-8 BOM开头,但不会确定。



一个选项是检测它是否实际上是一个完全有效的UTF-8文件,我想再次,这只会是暗示的。



我不熟悉重新编码工具本身,但是您可能想要看看它是否能够将文件从或相同的编码文件重新编码 - 如果您使用无效文件(即一个其中包含无效的UTF-8字节序列),它可能很好地将无效序列转换成问号或类似的东西。在这一点上,您可以通过将文件重新编码为UTF-8并查看输入和输出是否相同来检测文件是否是有效的UTF-8。



或者,这样做而不是使用recode实用程序 - 例如在C#中会很直接。



只是重申一下:所有这一切都是启发式的。如果你真的不知道文件的编码,没有任何事情会以100%的准确度告诉你。


I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

解决方案

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

这篇关于Windows-1252为UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆