从二进制文件中读取文本,如PDF [英] Reading text from binary file like PDF

查看:287
本文介绍了从二进制文件中读取文本,如PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用C ++读取二进制文件时遇到问题。目前我的代码是这样的:

I have a problem with reading binary file in C++. Currently my code is like this:

FILE *s=fopen(source, "rb");
fseek(s,0,SEEK_END);
size_file size=ftell(s);
rewind(s);

char *sbuffer=(char *) malloc(sizeof(char) * size);
if(sbuffer==NULL){
    fputs("Memory error", stderr);
    exit(2);
}
size_t result=fread(sbuffer,1,size,s);
if(result != size){
    fputs("Reading error",stderr);
    exit(3);
}
fclose(s);
cout<<sbuffer<<endl;

但是,终端上打印的字符都是随机字符,而不是我在PDF中写的字符文件。他们是:

However, the characters printed out on the terminal are all random characters instead of what I write in the PDF file. They are like:

% P D F - 1 . 3 
 % ? ? ? ? ? ? ? ? ? ? ? 
 4   0   o b j 
 < <   / L e n g t h   5   0   R   / F i l t e r   / F l a t e D e c o d e   > > 
 s t r e a m 
 x  ? ? ? j ? 0  E ? ? ? k ?  y Q E # ? ? ? m ? & ? ? @  % + ? .     ? ?  ? ? A i  ?     4 z \ 1 G W ? ?  - , ? ? ? (  ? ? ?  9 ? ? ? ? ?  \ ? } ? ? ? e ? ? ? ? 0 ? ? ? ~ ? , ? ? & 8 ? ? x e 4 ? r 
 | ? ? ? 
          ? ? ? ? E  > a ? ? z & ? Z ? < ?  }  '  ? ? ? j p ? ? Q 7 0 ? ? ? S %  - p ? ? ? 7 D  ?  ? ? ' Q z Q ?  ? ? ? ? ? ? ? ? ? \ 2 ? ? 7 ? ? ? < ? ? D ~  ? ? ? 

 e n d s t r e a m 
 e n d o b j 
 5   0   o b j 
 2 2 8 
 e n d o b j 
 2   0   o b j

我试图搜索很长时间,但不能找到如何获得实际的字符出去以后的处理。顺便说一下,我试图写一个压缩器,二进制文件作为输入和输出任何帮助非常感谢!

And many others characters like the above. I tried to search for a long time but cannot find out how to get the actual characters out for later processing. By the way, I'm trying to write a compressor which takes binary file as input and output. Any help here is highly appreciated!

推荐答案

只有几个文件格式像纯原始.TXT文本文件可以读取大多数文件格式(包括几乎任何二进制格式)都是格式。这意味着在文件中保存了某些结构

Only a few file formats like plain raw .TXT text files can be "read" and "understood" directly. Most of the file formats, including almost any binary format, is a .. format. This implies certain structure held inside the file. Completely contrary to the .TXT text file that is completely structure-less, or rather, it is one huge block of pure data.

打开一个写字板或Word或任何其他的文本文件,这是完全不需要结构的.TXT文本文件。 至少有点智能文本编辑器,并在其中写入一些文本,然后将其保存为RTF,DOC,ODT或任何其他非TXT文件。然后将其另存为TXT文件。

Open a WordPad or Word or any other a least somewhat intelligent text editor and write some text there and then save it as RTF, DOC, ODT or any other non-TXT file. Then save it as TXT file too.

下载HEX VIEWER / HEX EDITOR。无论什么。拿一个免费的,你不需要许多功能,只是一个显示原始二进制值在一列和ASCII文本在另一列。几乎任何免费的十六进制查看器/编辑器都可以这样做。

Download a HEX VIEWER/HEX EDITOR. Whatever one. Take one of those free, you don't need many features, just the one that displays raw binary values in one column and ASCII text in the other column. Almost any of free hex viewers/editors can do that.

打开并比较这两个文件。

Open and compare those two files. You will immediatelly see difference.

返回PDF:

PDF甚至可以包含与文本。你希望如何保持它,如果文本是只是坐在文件像TXT?如何嵌入图像位置/描述/数据? PDF甚至可以包含脚本,如果我记得很好,类似于JavaScript。可执行。在PDF类型的文档中,你可以有按钮做某事。这比文件中的文本复杂得多。

The PDF even can contain graphics interleaved with the text. How'd you expect to keep it, if the text were "just sitting in the file" like in TXT? How would the image position/description/data be embedded? The PDF can even contain scripts, if I remember well, similar to JavaScripts. Executable. In PDF-type document you can have buttons that do something. That's much more complicated than just text-in a-file.

二进制文件通常不包含任何易于理解的文本。它们具有以块为单位的文本,包含关于颜色,文本布局,分页等的元数据,或者甚至关于文档版本,创作,分类(...)的特殊结构。这一切都必须存储在某个地方。

Binary files usually does not contain any plain-readable text for your eyes. They have that text structured in blocks, wrapped in metadata about colors, text layout, paging and such, or even special structures about document versioning, authoring, classification, (...). This everything has to be stored somewhere.

通常,二进制文件具有段。第一部分通常称为HEADER。在内部,将有关于:格式类型,格式版本,文件/块/数据长度,图像分辨率等的信息。所有这些最可能是以二进制形式保存:没有800x600文本,只是| 00 | 00 | 03 | 20 | 00 | 00 | 02 | 58 |假设32位BE。在您已经阅读,解码并理解了描述之后,您将知道实际数据的开始位置,数据块的布局,以及如何解码和了解它们包含的内容。

Usually, binary files have sections. First section usually is called the HEADER. Inside, there will be information about: format type, format version, file/block/data length, image resolution, and similar. All those most probably will be kept in binary form: no "800x600" texts, just "|00|00|03|20|00|00|02|58|" assuming 32bit BE. After your have read, decoded and understood the description, then you will know where the actual data starts, how the data blocks are laid out, and how to decode them and understand what they contain.

编辑:

了解文本文件和二进制文件之间的区别是什么,请查看 http://en.wikipedia.org/wiki/Entropy_(information_theory)。然后尝试使用RLE播放( http:/ /www.daniweb.com/software-development/cpp/code/216388/basic-rle-file-compression-routine )或Huffman( http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html )只是从一些比较简单的事情开始。然后开始阅读更多关于霍夫曼代码,然后,你会合理地准备任务,如ZIP或LZH ..

After you understand what is the difference between text files and binary files, check out the absolute basics on http://en.wikipedia.org/wiki/Entropy_(information_theory). Then try playing with RLE (http://www.daniweb.com/software-development/cpp/code/216388/basic-rle-file-compression-routine) or Huffman (http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html) just to start on something relatively simple. Then start reading more about Huffman codes, and then, well, you will be reasonably prepared to the task, like ZIP or LZH..

这篇关于从二进制文件中读取文本,如PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆