如何检测文本文件的编码/代码页 [英] How can I detect the encoding/codepage of a text file

查看:252
本文介绍了如何检测文本文件的编码/代码页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的应用程序中,我们从不同的来源接收文本文件( .txt .csv 等) 。阅读时,这些文件有时会包含垃圾,因为在不同/未知的代码页中创建的文件。



有没有办法(自动)检测文本的代码页文件?



detectEncodingFromByteOrderMarks StreamReader 构造函数上工作对于 UTF8 和其他unicode标记的文件,但我正在寻找一种方法来检测代码页,如 ibm850 的Windows1252






感谢您的答案,这是我做的。



我们收到的文件来自最终用户,他们没有关于代码页的线索。接收者也是最终用户,现在这是他们对代码页的了解:代码页存在,令人讨厌。



解决方案: / p>


  • 在记事本中打开收到的文件,看看一个乱码的文字。如果有人被称为François或某事,用你的人类智慧,你可以猜到这一点。

  • 我创建了一个小的应用程序,用户可以使用它打开文件,并输入一个文本当使用正确的代码页时,该用户知道它将出现在文件中。

  • 循环遍历所有代码页,并显示给用户提供的文本的解决方案。

  • 如果多一个代码页弹出,请询问用户指定更多文本。


解决方案

你不能检测到代码页,你需要被告知。您可以分析字节并猜测它,但这可能会产生一些奇怪(有趣的)结果。我现在找不到,但我确信记事本可以用中文显示英文文本。



无论如何,这是你需要阅读的:
绝对最小的每个软件开发人员绝对必须了解Unicode和字符集(No Excuses!)



具体Joel说:


关于编码的单一最重要的事实



如果你完全忘记了我刚刚解释的一切,请记住一个非常重要的事实。有一个字符串不知道它使用什么编码是没有意义的。你不能再把头放在沙滩上,假装普通文本是ASCII。
没有没有这样的东西作为纯文本。



如果你有一个字符串,内存,文件或电子邮件,你必须知道它是什么编码,否则您无法解释或正确显示给用户。



In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.


Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
  • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.

解决方案

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

这篇关于如何检测文本文件的编码/代码页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆