Windows-1252 转 UTF-8 编码 [英] Windows-1252 to UTF-8 encoding

查看:41
本文介绍了Windows-1252 转 UTF-8 编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将某些文件从 Windows 机器复制到 Linux 机器.所以所有的 Windows 编码 (windows-1252) 文件都需要转换为 UTF-8.已经在 UTF-8 中的文件不应更改.我打算为此使用 recode 实用程序.我如何指定 recode 实用程序应该只转换 windows-1252 编码的文件而不是 UTF-8 文件?

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

重新编码的使用示例:

recode windows-1252.. myfile.txt

这会将 myfile.txt 从 windows-1252 转换为 UTF-8.在这样做之前,我想知道 myfile.txt 实际上是 windows-1252 编码而不是 UTF-8 编码.否则,我相信这会损坏文件.

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

推荐答案

您希望重新编码如何知道文件是 Windows-1252?理论上,我相信任何文件都是有效的 Windows-1252 文件,因为它将每个可能的字节映射到一个字符.

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

现在肯定有一些特征会强烈建议它是 UTF-8 - 例如,如果它以 UTF-8 BOM 开头 - 但它们不会是确定的.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

一种选择是首先检测它是否实际上是一个完全有效的 UTF-8 文件,我想……再说一次,这只是暗示性的.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

我不熟悉重新编码工具本身,但您可能想看看它是否能够将文件重新编码为 相同 编码 - 如果您使用无效文件(即包含无效 UTF-8 字节序列的一个)它很可能将无效序列转换为问号或类似的东西.此时,您可以通过将文件重新编码为 UTF-8 并查看输入和输出是否相同来检测文件是否为有效的 UTF-8.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

或者,以编程方式执行此操作,而不是使用重新编码实用程序 - 例如,这在 C# 中非常简单.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

不过重申一下:所有这些都是启发式的.如果您真的不知道文件的编码,那么没有什么能 100% 准确地告诉您.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

这篇关于Windows-1252 转 UTF-8 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆