在C中创建文件头(文件的元数据) [英] Create file header (metadata of file) in C

查看:65
本文介绍了在C中创建文件头(文件的元数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文件头包含有关文件&mdash的所有数据;元数据.我想创建一个包含元数据的空白文件,然后将其他文件内容添加到该空白文件中,并且需要更改(修改)元数据. C中是否有用于创建文件头的库?如何在C中读写文件头?

File header contains the all data about the file — metadata. I want to create a blank file with metadata, then I want to add other file content to this blank file and need to change (modify) metadata. Is there any library in C for creating file header? How to read/write file header in C?

metadata = {
    file_name;
    file_size;
    file_type;
    file_name_size;
    total_files;
}

推荐答案

可能存在许多处理特定文件格式的库,例如tar的变体,但没有一个适合您的特定标头的库格式.

There are probably a number of libraries that handle specific file formats, such as the variations on tar, but not one that will be adapted to your particular header format.

首先,您需要确定元数据的大小是固定的还是可变的.

You will need to decide, first, whether your metadata is of a fixed or variable size.

如果它是固定大小,则在开始时跳过那么多字节,写入文件的其余部分,然后倒回并填充元数据是相对容易的.如果一开始只知道可变大小的部分,则可以使用相同的方法进行处理-编写第一个版本,完成后返回并编写最终版本.

If it is a fixed size, it is relatively easy to skip over that many bytes at the start, write the rest of the file, and then rewind and fill in the metadata. If the only variable size parts are known at the start, you can handle it much the same way - write the first version then go back when you're done and write the final version.

如果直到最后都不知道可变材料的大小,那么您会遇到一些困难.您可能最终会用文件的大部分内容写一个临时文件,然后在知道所有可变大小的元数据后,将元数据标头写入新的(最终)文件,然后在临时文件之后复制该临时文件.元数据.

If you won't know the size of the variable material until the end, you are in some difficulty. You probably end up writing a temporary file with the bulk of the file, then when you're done and know all the variable size metadata, you write the metadata header to a new (the final) file, then copy the temporary file after the metadata.

请注意,您应该在磁盘上的数据中将文件名的大小(长度)放在实际文件名之前.然后,您可以读取名称的大小,并分配正确的空间并读取正确数量的数据.在文件名本身后面放置文件名的长度确实并没有太大帮助.

Note that you should place the size (length) of the file name before the actual file name in the data on disk. Then you can read how big the name is and allocate the right space and read the correct amount of data. Placing the length of the file name after the file name itself really doesn't help very much.

您还需要考虑标题是二进制数据还是文本.文件名部分将是文本,但是数字可以是2字节或4字节的二进制值,或者是ASCII(纯文本)可变长度的等效项.通常,调试文本表示形式更容易,但是如果您确实使用文本,则更有可能需要可变长度的数据.但是,您也可以始终使用固定大小的空白填充.相对于二进制,文本的另一个优点是文本可以在各种计算机体系结构之间移植,而二进制则带来了大端和小端机器的问题,等等.

You also need to think whether your header will be binary data or text. The file name component will be text, but the number could be 2-byte or 4-byte binary values, or ASCII (plain text) variable length equivalents. It is usually easier to debug text representations, but it is more likely that you'll want variable length data if you do use text. However, you can always use a fixed size with blank padding too. One other advantage of text over binary is that text is portable across machine architectures, whereas binary brings up questions of big-endian vs little-endian machines, and so on an so forth.

您还应该考虑使用幻数"来识别文件中包含正确种类的数据. 数字"可能是ASCII字符串,例如某些版本的ar标头中使用的!<arch>\n.或PDF文件开头使用的%PDF-1.3\n.话虽这么说,tar在最初的字节中基本上没有魔术数字就消失了,但这在当今是不寻常的设计. file程序对魔术数字了解很多.有时会在文件中找到其数据-例如Mac OS X的/usr/share/file下的文件.

You should also consider using a 'magic number' to allow you to identify that the file contains the right sort of data. The 'number' might be an ASCII string, like the !<arch>\n used in some versions of ar headers, for example. Or the %PDF-1.3\n used at the start of a PDF file. Having said that, tar largely gets away without a magic number in the first bytes, but that is an unusual design these days. The file program knows a lot about magic numbers. Its data is sometimes found in a file - such as the files under /usr/share/file for Mac OS X.

您能举例说明吗?

Can you please explain by any example?

我处理的一种文件格式是用32位(带符号)数字标识的消息,消息的长度是可变的,因此偏移量也是可变的.该文件以平台无关的二进制格式写入.这些数字以高位开头,MSB在前.消息号当前限制在±99,999范围内(因此整个系统中的消息量不足200,000条).

One file format I deal with is for messages identified by a 32-bit (signed) number, with variable lengths for the messages and therefore variable offsets. The file is written in a platform-neutral but binary format. The numbers are written big-endian, with the MSB first. The message numbers are currently constrained to the range ±99,999 (so there is room for just under 200,000 messages in the system as a whole).

文件头包含:

  • 2字节(无符号)幻数
  • 文件中包含的消息数的
  • 2字节(无符号)计数N
  • 2-byte (unsigned) magic number
  • 2-byte (unsigned) count of the number of messages contained in the file, N

后跟N个条目,每个条目描述一条消息:

It is followed by N entries, each of which describes a message:

  • 4字节(带符号)消息号
  • 2字节(无符号)消息长度
  • 邮件开头的4字节(无符号)偏移量

N个条目按消息号的排序顺序,但是不要求消息号是连续的.缺少数字简直就是缺失.

The N entries are in sorted order of message number, but there is no requirement that the message numbers be contiguous. Missing numbers are simply missing.

在N个条目之后,是实际的消息文本,每个文本由相应条目标识的适当数量的字节以及ASCII NUL '\0'字节组成.

After the N entries, the actual message texts follow, each consisting of the appropriate number of bytes identified by the corresponding entry plus an ASCII NUL '\0' byte.

在生成文件时,每条消息的文本都按照处理的顺序写到中间文件中,并在文件中记录消息的偏移量.消息是按顺序读取还是写入都无所谓;重要的是,从标头末尾开始的偏移量记录在标头记录中.读完所有消息后,可以按数字顺序将文件条目的内存副本分类,然后可以写入最终文件.首先是魔术数和消息数;然后是N条描述消息的条目;然后是从中间文件复制的消息文本.

As the file is generated, the text of each message is written out to an intermediate file in the order processed, recording the offset of the message in the file. It doesn't matter whether the messages are read or written in order; all that matters is that the offset from the end of the header is recorded in a header record. Once all the messages have been read in, the in-memory copy of the file entries can be sorted into numeric order, and the final file can be written. First there is the magic number and the number of messages; then N entries describing the messages; followed by the text of the messages copied from the intermediate file.

读取消息号M很简单.您通过N个条目进行二进制搜索以找到M的条目.如果不存在,那就这样-这是一个错误.如果存在,您将知道在文件中的位置以及文件的长度.

Reading a message number M is simple enough. You do a binary search through the N entries to find the entry for M. If it isn't there, so be it - that's an error. If it is there, you know where to find it in the file and how long it is.

数据是固定的但二进制格式的事实并没有真正使事情复杂化.您在big-endian和little-endian机器上都使用相同的功能将数字读入本机格式.从理论上讲,您可以针对大型字节序机器进行优化,但前提是该机器不会因数据对齐不足而出现问题.更容易忘记优化是可能的,并且只需在各处使用相同的代码即可.

The fact that the data is in a fixed but binary format doesn't really complicate things. You use the same functions on both big-endian and little-endian machines to read the number into native format. In theory, you could optimize for a big-endian machine, but only if the machine doesn't have problems with insufficiently aligned data. It is simpler to forget that the optimization might be possible and simply use the same code everywhere.

如果将上述格式转换为文本格式,则可能为魔术数字保留了8个字节(例如)(它可能是7个字母的字符串,后跟换行符),并保留了6个字节消息的数量(5位数字加换行符).每个消息条目可以为消息编号保留6个字节(编号为±99,999),再加上一个空格,再为其保留4个字节的长度(最大8KiB),再加上一个空格,再加上一个8字节的偏移量(7位数字)加上换行符.

If the format described above was converted to a text format, then it would probably have 8 bytes (say) reserved for the magic number (which might well be a 7-letter string followed by a newline), and 6 bytes reserved for the number of messages (5 digits plus a newline). Each of the message entries could be reserved 6 bytes for the message number (±99,999 for the number), plus a space, plus 4 bytes for the length (maximum, 8KiB) plus a space, plus an offset in 8 bytes (7 digits plus a newline).

MAGICNO
12345
-99999 8000 0000000
-90210   38 0008000
...

同样,文本文件具有可读性的优点是,您可以轻松查看文本文件并查看数据的含义.

Again, the advantage of the text file for readability is that you can look at the file and see the meaning of the data quite easily.

您可以在此主题上进行无尽的变化.

You can have endless variations on this theme.

这篇关于在C中创建文件头(文件的元数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆