如何确定文件的代码页(应用了一些代码页转换) [英] How to determine codepage of a file (that had some codepage transformation applied to it)

查看:286
本文介绍了如何确定文件的代码页(应用了一些代码页转换)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,如果我知道ć应该是ć,如何找出代码页转换发生在那里?



如果有一个在线网站,这将是很好,但任何工具都会做的工作。最终目标是逆转代码页转换(使用 iconv recode ,但工具并不重要,我会使用包括python脚本在内的任何东西)



编辑:


你能更多一点冗长吗?你知道确定一些子字符串应该是。还是只知道语言?或者只是猜测?并且应用的转换是正确的(即它在其他字符集中有效)?或者是从字符集X到Y的单一转换,但文本实际上是在Z,所以它现在错了吗?


实际上,我最好找一个工具来告诉我发生了什么可能发生),所以我可以尝试将其转换回正确的编码。



在我想解决的问题发生什么现在是什么描述此 answer - utf-8文本文件已打开as asi文本文件,然后导出为csv。

解决方案

主要的问题是所有基于ascii的编码(iso-8859- *,dos和windows代码页)使用相同的代码点范围,所以没有特定的代码点或者一组代码点会告诉你文本是什么代码页。 p>

有一个编码很容易说明。如果它是有效的UTF-8,它几乎肯定没有iso-8859- *也没有任何Windows代码页,因为虽然所有字节值在它们是有效的,有效的utf-8多字节序列出现在其中的文本中的机会是几乎为零。



这取决于可能涉及的进一步编码。 Shift-JIS或Big-5中的有效序列也不可能在任何其他编码中有效,而分离类似的编码,如cp1250和iso-8859-2需要拼写检查包含3个不同字符的单词,



如果你可以限制可能发生的转换的数量,它不应该太难了,放一个python脚本,将尝试他们,消除明显的错误,并使用拼写检查器来挑选最可能的。我不知道有什么工具可以做到。


For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?

It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)

EDIT:

Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?

Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.

What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.

解决方案

It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.

There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.

Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.

If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.

这篇关于如何确定文件的代码页(应用了一些代码页转换)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆