为什么我应该使用人类可读的文件格式? [英] Why should I use a human readable file format?

查看:26
本文介绍了为什么我应该使用人类可读的文件格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么我应该优先使用人类可读的文件格式而不是二进制格式?有没有出现过不是这种情况的情况?

Why should I use a human readable file format in preference to a binary one? Is there ever a situation when this isn't the case?

我在最初发布问题时确实有这个解释,但现在不那么相关了:

I did have this as an explanation when initially posting the question, but it's not so relevant now:

在回答这个问题时,我想向提问者推荐一个关于为什么使用人类可读文件的标准 SO 答案格式是个好主意.然后我找了一个,没有找到.那么问题来了

When answering this question I wanted to refer the asker to a standard SO answer on why using a human readable file format is a good idea. Then I searched for one and couldn't find one. So here's the question

推荐答案

视情况而定

正确的答案是视情况而定.例如,如果您正在编写音频/视频数据,如果您将其转换为人类可读的格式,它的可读性将不会很高!Word 文档是典型的例子,人们希望它们是人类可读的,因此更加灵活,并且通过迁移到 XML,MS 正在朝着这个方向发展.

It depends

The right answer is it depends. If you are writing audio/video data for instance, if you crowbar it into a human readable format, it won't be very readable! And word documents are the classic example where people have wished they were human readable, so more flexible, and by moving to XML MS are going that way.

比二进制或文本更重要的是标准与否.如果您使用标准格式,那么您和下一个人很可能不必编写解析器,这对每个人来说都是一种胜利.

Much more important than binary or text is a standard or not a standard. If you use a standard format, then chances are you and the next guy won't have to write a parser, and that's a win for everyone.

以下是一些固执的原因,如果您必须编写自己的格式(和解析器),您可能想要选择一种而不是另一种.

Following this are some opinionated reasons why you might want to choose one over the other, if you have to write your own format (and parser).

  1. 下一个人.考虑维护开发人员在 30 年或 6 个月后查看您的代码.是的,他应该有源代码.是的,他应该有文件和评论.但他很可能不会.作为那个人,不得不拯救或转换旧的、极其有价值的数据,我会感谢你让我能看到和理解的东西.
  2. 让我用我自己的工具阅读和编写它.如果我是 emacs 用户,我可以使用它.或 Vim、记事本或……即使您创建了出色的工具或库,它们也可能无法在我的平台上运行,甚至根本无法运行.此外,我还可以使用我的工具创建新数据.
  3. 税不是很大 - 存储是免费的.几乎总是磁盘空间是免费的.如果不是,你会知道的.不要担心几个尖括号或逗号,通常不会有太大的区别.过早的优化是万恶之源.如果您真的很担心,只需使用标准的压缩工具,然后您就会得到一个小的人类可读格式 - 任何人都可以运行解压缩.
  4. 税收并不大 - 计算机速度很快.解析二进制文件可能会更快.直到您需要添加额外的列或数据类型,或同时支持旧文件和新文件.(虽然这可以通过 Protocol Buffers 缓解)
  5. 有很多好的格式.即使您不喜欢 XML.试试 CSV.或 JSON.或 .properties.甚至是 XML.有很多工具可以用很多语言来解析这些.如果神秘地丢失了所有源代码,只需 5 分钟即可重新编写它们.
  6. 差异变得容易.当您签入版本控制时,更容易看到发生了哪些变化.并在 Web 上查看.或者你的 iPhone.二进制文件,您知道发生了一些变化,但您依靠评论来告诉您什么.
  7. 合并变得容易.您仍然会在网络上收到询问如何将一个 PDF 附加到另一个 PDF 的问题.文本不会发生这种情况.
  8. 如果损坏更容易修复.尝试修复损坏的文本文档与损坏的 zip 存档.说的够多了.
  9. 每种语言(和平台)都可以读写.当然,二进制是计算机的母语,所以每种语言也都支持二进制.但是许多经典的小工具脚本语言在处理文本数据时效果更好.我想不出一种语言可以很好地处理二进制而不是文本(可能是汇编程序),但反过来不行.这意味着您的程序可以与其他您从未想过的程序进行交互,或者是在您的程序之前 30 年编写的.Unix 的成功是有原因的.
  1. The next guy. Consider the maintaining developer looking at your code 30 years or six months from now. Yes, he should have the source code. Yes he should have the documents and the comments. But he quite likely won't. And having been that guy, and had to rescue or convert old, extremely, valuable data, I'll thank you for for making it something I can just look at and understand.
  2. Let me read AND WRITE it with my own tools. If I'm an emacs user I can use that. Or Vim, or notepad or ... Even if you've created great tools or libraries, they might not run on my platform, or even run at all any more. Also, I can then create new data with my tools.
  3. The tax isn't that big - storage is free. Nearly always disc space is free. And if it isn't you'll know. Don't worry about a few angle brackets or commas, usually it won't make that much difference. Premature optimisation is the root of all evil. And if you are really worried just use a standard compression tool, and then you have a small human readable format - anyone can run unzip.
  4. The tax isn't that big - computers are quick. It might be a faster to parse binary. Until you need to add an extra column, or data type, or support both legacy and new files. (though this is mitigated with Protocol Buffers)
  5. There are a lot of good formats out there. Even if you don't like XML. Try CSV. Or JSON. Or .properties. Or even XML. Lots of tools exist for parsing these already in lots of languages. And it only takes 5mins to write them again if mysteriously all the source code gets lost.
  6. Diffs become easy. When you check in to version control it is much easier to see what has changed. And view it on the Web. Or your iPhone. Binary, you know something has changed, but you rely on the comments to tell you what.
  7. Merges become easy. You still get questions on the web asking how to append one PDF to another. This doesn't happen with Text.
  8. Easier to repair if corrupted. Try and repair a corrupt text document vs. a corrupt zip archive. Enough said.
  9. Every language (and platform) can read or write it. Of course, binary is the native language for computers, so every language will support binary too. But a lot of the classic little tool scripting languages work a lot better with text data. I can't think of a language that works well with binary and not with text (assembler maybe) but not the other way round. And that means your programs can interact with other programs you haven't even thought of, or that were written 30 years before yours. There are reasons Unix was successful.

为什么不,而是使用二进制?

  1. 您可能拥有大量数据 - 可能达到 TB.然后因子 2 可能真的很重要.但是过早的优化仍然是万恶之源.现在使用人类如何,稍后转换?不会花太多时间.
  2. 存储可能是免费的,但带宽不是(Jon Skeet 在评论中).如果您在网络上散布文件,那么大小确实会产生影响.甚至往返光盘的带宽也可能是一个限制因素.
  3. 真正的性能密集型代码.二进制可以认真优化.数据库通常没有自己的纯文本格式是有原因的.
  4. 二进制格式可能是标准.所以使用 PNG、MP3 或 MPEG.这让下一个人的工作变得更轻松(至少在接下来的 10 年里).
  5. 有很多好的二进制格式.有些是此类数据的全球标准.或者可能是硬件设备的标准.有些是标准的序列化框架.一个很好的例子是 Google 协议缓冲区.另一个例子:Bencode
  6. 更容易嵌入二进制.一些数据已经是二进制的,你需要嵌入它.这在二进制文件格式中很自然地工作,但在人类可读的文件中看起来很丑陋并且效率很低,并且通常会阻止它们成为人类可读的.
  7. 故意隐瞒.有时您不希望您的数据在做什么.加密比通过默默无闻带来的意外安全要好,但如果您正在加密,您不妨将其设为二进制并用它完成.
  1. You might have a lot of data - terabytes maybe. And then a factor of 2 could really matter. But premature optimization is still the root of all evil. How about use a human one now, and convert later? It won't take much time.
  2. Storage might be free but bandwidth isn't (Jon Skeet in comments). If you are throwing files around the network then size can really make a difference. Even bandwidth to and from disc can be a limiting factor.
  3. Really performance intensive code. Binary can be seriously optimised. There is a reason databases don't normally have their own plain text format.
  4. A binary format might be the standard. So use PNG, MP3 or MPEG. It makes the next guys job easier (for at least the next 10 years).
  5. There are lots of good binary formats out there. Some are global standards for that type of data. Or might be a standard for hardware devices. Some are standard serialization frameworks. A great example is Google Protocol Buffers. Another example: Bencode
  6. Easier to embed binary. Some data already is binary and you need to embed it. This works naturally in binary file formats, but looks ugly and is very inefficient in human readable ones, and usually stops them being human readable.
  7. Deliberate obscurity. Sometimes you don't want it obvious what your data is doing. Encryption is better than accidental security through obscurity, but if you are encrypting you might as well make it binary and be done with it.

有争议

  1. 更容易解析.人们声称文本和二进制都更容易解析.现在显然最容易解析的是当您的语言或库支持解析时,这对于某些二进制和某些人类可读格式是正确的,因此并不真正支持.可以清楚地选择二进制格式,因此它们易于解析,但人类可读(想想 CSV 或固定宽度)也是如此,所以我认为这一点没有实际意义.一些二进制格式可以直接转储到内存中并按原样使用,所以这可以说是最容易解析的,特别是如果数字(不仅仅是字符串).但是我认为大多数人会认为人类可读的解析更容易调试,因为更容易看到调试器中发生了什么(稍微).
  2. 更容易控制.是的,更有可能有人会在他们的编辑器中破坏文本数据,或者当一种 Unicode 格式有效而另一种无效时会抱怨.使用不太可能的二进制数据.但是,人和硬件仍然可以处理二进制数据.并且您可以(并且应该)为人类可读的数据指定一种文本编码,可以是灵活的也可以是固定的.
  1. Easier to parse. People have claimed that both text and binary are easier to parse. Now clearly the easiest to parse is when your language or library supports parsing, and this is true for some binary and some human readable formats, so doesn't really support either. Binary formats can clearly be chosen so they are easy to parse, but so can human readable (think CSV or fixed width) so I think this point is moot. Some binary formats can just be dumped into memory and used as is, so this could be said to be the easiest to parse, especially if numbers (not just strings are involved. However I think most people would argue human readable parsing is easier to debug, as it is easier to see what is going on in the debugger (slightly).
  2. Easier to control. Yes, it is more likely someone will mangle text data in their editor, or will moan when one Unicode format works and another doesn't. With binary data that is less likely. However, people and hardware can still mangle binary data. And you can (and should) specify a text encoding for human-readable data, either flexible or fixed.

归根结底,我认为两者都不能在这里真正占据优势.

At the end of the day, I don't think either can really claim an advantage here.

你确定你真的想要一个文件吗?你考虑过数据库吗?:-)

Are you sure you really want a file? Have you considered a database? :-)

积分

很多这个答案都将其他人在其他答案中写的东西合并在一起(你可以在那里看到它们).尤其要感谢 Jon Skeet 的评论(在这里和离线),他提出了可以改进的方法.

A lot of this answer is merging together stuff other people wrote in other answers (you can see them there). And especially big thanks to Jon Skeet for his comments (both here and offline) for suggesting ways it could be improved.

这篇关于为什么我应该使用人类可读的文件格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆