如何转换“二进制文本"?到“可见文本"? [英] How to convert "binary text" to "visible text"?

查看:37
本文介绍了如何转换“二进制文本"?到“可见文本"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个充满非ASCII字符的文本文件.我无法通过 file enca 来检测编码.

I have a text file full of non-ASCII characters. I can not detect the encoding by either file or enca.

file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text

enca non_ascii.txt
Unrecognized encoding

但是我可以在Windows Notepad ++中正常打开它

But I can open it normally in Windows Notepad++

上面的表达引起误解.非常遗憾.实际上,我选择了原始文件的某些部分并将其放入新的文本文件中,然后在记事本++中打开.

这两个部分如下所示.它们由记事本++以2种不同的方式解码.

The 2 parts shows as below. They are decoded in 2 different ways by notepad++.

问题:

  1. 如何检测Linux下的文件编码?
  2. 如何恢复由< F1>< EE>< E9>< E4>< FF> 表示的字符?即使сойдя"被编码为< F1>< EE>< E9>< E4>< FF> ,我也无法通过"grep'сойдя'win.txt"获得结果.代码>?
  1. How could I detect the files encoding under linux?
  2. how do I recover the characters represented by <F1><EE><E9><E4><FF>? I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?

文件内容切片如下:

less non_ascii.txt
"non_ascii.txt" may be a binary file.  See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>

推荐答案

您的问题确实包含两个部分:(1)如何识别未知编码,以及(2)如何将其转换为有用的内容?

Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?

第一部分是真正的挑战,实际上不能用通用术语来回答-在一般情况下,没有可靠的方法来识别未知的8位编码.一些编码可以给您很好的提示(UTF-8是一个很好的例子),并且在许多情况下,如果您很好地知道了文本应该代表什么,则可以解决问题.

The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.

映射8位字符的含义可能会有所帮助(咳嗽,链接指向我的),并在这种情况下迅速提示 Windows代码页1251 .十六进制转储和图片带有您所期望的表示形式的荣誉!

A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!

通过这种方式,转换很容易.

With that out of the way, converting is easy.

iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt

假设您的Linux系统已设置为在终端上使用UTF-8,则您的 grep 命令现在应该可以在 utf-8.txt 上运行.

Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.

指示某些文本是"ANSI".(无论如何都是伪造的)可能只是一个红色鲱鱼-据我所知,您摘录中的所有内容看起来都像格式良好的CP1251.

The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.

某些诸如 chardet 之类的工具至少可以完成合理的工作引导您朝着正确的方向前进,尽管您必须理解,就像人类专家一样,他们必须猜测文本应该代表什么.在某些极端情况下,它们只是没有足够的信息来正确猜测,或者是因为存在几种候选编码,它们之间的差异很小(例如,Latin-1,Latin-9与Windows-1252,所有这些都与前128个位置使用普通的7位US-ASCII),或者因为输入的信息不足以建立任何通用模式.

Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

这篇关于如何转换“二进制文本"?到“可见文本"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆