蟒,文件(1) - 为什么用于确定文本VS二进制文件的数字[7,8,9,10,12,13,27]和范围(0x20的,为0x100) [英] Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file

查看:121
本文介绍了蟒,文件(1) - 为什么用于确定文本VS二进制文件的数字[7,8,9,10,12,13,27]和范围(0x20的,为0x100)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于确定文件是否在Python 二进制或文本解决方案,回答者用途:

Regarding a solution for determining whether a file is binary or text in python, the answerer uses:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

然后使用 .translate(无,textchars)删除(或没有替换)文件中的所有这些字符读入作为二进制文件。

and then uses .translate(None, textchars) to remove (or replace by nothing) all such characters in a file read in as binary.

应答者还认为,这种选择的数字是基于文件(1)行为(对于什么是文字的,什么不是)。是什么样的,这些数字是从二进制确定文本文件?所以显著

The answerer also argues that this choice of numbers is "based on file(1) behaviour" (for what's text and what's not). What is so significant about these numbers is determining text files from binary?

推荐答案

他们重新present为可打印文本,再加上换行,空格和回车之类的最常见codepoints。 ASCII被掩盖到0x7F,而像的Latin-1或Windows codePAGE 1251标准,用剩余的128个字节的重音字符等。

They represent the most-common codepoints for printable text, plus newlines, spaces and carriage returns and the like. ASCII is covered up to 0x7F, and standards like Latin-1 or Windows Codepage 1251 use the remaining 128 bytes for accented characters, etc.

您期望文本的只有的使用这些codepoints。二进制数据将使用的所有的在0x00-0xFF范围codepoints;例如一个文本文件可能不会使用\\ X00(NUL)或\\ x1F的(单位分隔符在ASCII标准)。

You'd expect text to only use those codepoints. Binary data would use all codepoints in the range 0x00-0xFF; e.g. a text file will probably not use \x00 (NUL) or \x1F (Unit Separator in the ASCII standard).

这是一个启发式的最好,虽然。一些文本文件仍可以尝试使用 C0控制codeS 这7之外字符明确命名,我敢肯定,二进制数据存在恰好不包括未列入 textchars的25字节值字符串。

It is a heuristic at best, though. Some text files may still try and use C0 control codes outside those 7 characters explicitly named, and I'm sure binary data exists that happens to not include the 25 byte values not included in the textchars string.

范围的作者可能是基于它在 文件命令text_chars 。它标志着字节非文本,ASCII,拉丁语1或非ISO扩展ASCII,包括在哪里选择为什么这些codepoints文档:

The author of the range probably based it on the text_chars table from the file command. It marks bytes as non-text, ASCII, Latin-1 or non-ISO extended ASCII, and includes documentation on why those codepoints where chosen:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

有趣的是,该表的不包括的0x7F的,其中code你发现没有。

Interestingly enough, that table excludes 0x7F, which the code you found does not.

这篇关于蟒,文件(1) - 为什么用于确定文本VS二进制文件的数字[7,8,9,10,12,13,27]和范围(0x20的,为0x100)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆