如何区分“二进制"和“文本"文件? [英] How do I distinguish between 'binary' and 'text' files?

查看:108
本文介绍了如何区分“二进制"和“文本"文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

非正式地,我们大多数人都知道有二进制"文件(目标文件,图像,电影,可执行文件,专有文档格式等)和文本"文件(源代码,XML文件,HTML文件,电子邮件等) ).

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).

通常,您需要知道文件的内容才能执行任何有用的操作,并且从编码的角度来看,如果编码是二进制"或文本",则没关系.当然,文件只存储数据字节,因此它们都是二进制"的,而文本"在不知道编码的情况下毫无意义.但是,谈论二进制"和文本"文件仍然很有用,但是为了避免冒犯具有这种不精确定义的任何人,我将继续使用恐吓"引号.

In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.

但是,有各种各样的工具可以处理各种各样的文件,并且实际上,您希望根据文件是文本"还是二进制"来做一些不同的事情.例如在控制台上输出数据的任何工具.普通的文本"看起来不错,并且很有用. 二进制"数据弄乱了您的终端,通常看起来没有用. GNU grep在确定是否应将匹配结果输出到控制台时至少使用此区别.

However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.

因此,问题是,如何判断文件是文本"还是二进制"?而且要限制的是,您如何在Linux之类的文件系统上分辨?我不知道任何表示文件类型"的文件系统元数据,因此问题进一步变成了通过检查文件的内容,如何确定它是文本"还是二进制"?为了简单起见,让我们将文本"限制为表示可在用户控制台上打印的字符.特别是您将如何实现呢? (我以为这是隐含在此站点上的,但是我认为通常来说,指向执行此操作的现有代码应该很有帮助,我应该已经指定了),我并不是真的可以使用现有程序做什么这个.

So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.

推荐答案

我公司制造的电子表格软件可以读取多种二进制文件格式和文本文件.

The spreadsheet software my company makes reads a number of binary file formats as well as text files.

我们首先查看魔数的前几个字节.如果我们无法识别所读取的任何二进制类型的幻数,那么我们将查看文件的前2K字节,以查看它是否看起来像是 UTF-16 或编码为主机操作的当前代码页的文本文件系统.如果它没有通过所有这些测试,则假定它不是我们可以处理并抛出适当异常的文件.

We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

这篇关于如何区分“二进制"和“文本"文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆