以编程方式从PDF文件中翻录文本(手动)-缺少一些文本 [英] Programatically rip text from a PDF File (by hand) - Missing some text

查看:95
本文介绍了以编程方式从PDF文件中翻录文本(手动)-缺少一些文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我对使用解析库不感兴趣.这是我自己的娱乐.

Note: I am not interested in using a parsing library. This is for my own entertainment.

我一直在尝试从PDF文件中提取文本以进行搜索,但是无法从某些pdf文件中提取文本.

I've been experimenting with ripping text out of PDF files for a search gizmo, but am unable to extract text from some pdf files.

请注意,这比直接解析要容易得多;我不在乎是否在输出中无意中包含了一些垃圾,我也不在乎文档的格式是否完整.我什至不在乎单词是否按顺序出现.

Note that this is a much easier problem than straight up parsing; I don't care if I inadvertently include some garbage in my output, nor do I really care if the formatting of the document is intact. I don't even care if the words come out in order.

第一步,我使用此项目.基本上,它所做的只是在pdf文件中搜索zlib流,将它们放平,然后拉出它在括号中找到的所有文本.这无法解析卡在<< >>块中的数据,但是我的理解是,这是针对十六进制编码的数据块,似乎并没有出现在我未能解析的测试文件中……或位于至少我看不到他们.

As a first step, I created a very simple pdf parser using the strategy found on this project. Basically, all it does is search pdf files for zlib streams, deflates them, and pulls out any text it finds in parentheses. This fails to parse data stuck inside of << >> blocks, but my understanding is that this is for hex-encoded blobs of data, which doesn't seem to be in the test file that I am failing to parse...or at least I don't see them.

类似地,尽管 PDFBox 成功.但是,后两个项目具有太多的间接层,因此不易检查.我很难弄清楚他们在做什么,部分原因是我没有真正使用任何一种语言来习惯于以任何重要方式调试它.

Similarly, iText.Net also fails, though PDFMiner and PDFBox succeed. However, the latter two projects have too many layers of indirection to be easily examined; I had trouble figuring out exactly what they were doing, in part because I don't really use either language enough to be accustomed to debugging it in any significant manner.

我的目标是创建一个文本破解程序,以尽可能少地了解pdf格式本身的方式从pdf文件中抓取文本(例如,我的测试解析器从括号中抓取文本,但不了解pdf的哪一部分)它正在检查的是标题).

My goal is to create a text ripper grabs text out of a pdf file with as little understanding of the pdf format itself as possible (e.g. my test parser grabs text out of parentheses, but has no understanding of which portion of the pdf it is examining is the header).

推荐答案

从PDF文件中提取内容可能会有些复杂.我将其作为我的日常工作,我想可以为您指明正确的方向.

Extracting content out of a PDF file can get a little complex. I do this as my daily job, and I think I can point you to the right direction.

您要执行的操作(在括号之间提取字符串)仅适用于简单的WinAnsi或MacRoman编码,并与Type1或TrueType字体一起使用.不幸的是,这些单字节编码不支持正确的Unicode内容.您的示例文档使用Type0 aka CID字体,其中每个字符都由一个字形索引标识.这些是非标准的即席编码,其中字体的设计者可以以任意方式将字形索引分配给任何字符.有时,PDF的生产者会故意修改编码.

What you are trying to do (extracting string between parentheses) works with simple WinAnsi or MacRoman encoding only, used with Type1 or TrueType fonts. Unfortunately these single-byte encodings do not support proper Unicode content. Your sample document uses Type0 aka CID fonts, where each character is identified by a glyph index. These are non-standard, ad-hoc encodings, where the designer of the font may assign a glyph index to any character in an arbitrary way. Sometimes the producer of the PDF intentionally mangles the encoding.

它的工作方式是从目录开始,您解析页面树.识别页面对象后,您将解析其内容及其资源.资源字典包含页面使用的字体的列表.每个CID字体对象都包含一个ToUnicode流,这是一个cmap(字符映射),用于建立字形索引与其Unicode值之间的关系.例如:

The way it works is that starting with the catalog, you parse the page tree. Once you identify a page object, you parse its contents as well as its resources. The resources dictionary contains a list of fonts used by the page. Each CID font object contains a ToUnicode stream, which is a cmap (character map), which establishes the relationship between the glyph indexes and their Unicode value. For example:

<01> <0044>
<02> <0061>
<03> <0074>
<04> <0020>

这意味着字形01是Unicode U + 0044,字形02是U + 0061,依此类推.您必须使用此查找表将字形ID转换回Unicode.

This means the glyph 01 is Unicode U+0044, the glyph 02 is U+0061, and so on. You have to use this lookup table to translate glyph IDs back into Unicode.

页面内容本身具有两个重要的运算符. Tf是字体选择器,这很重要,因为它标识字体对象.每种字体都有自己的ToUnicode cmap,因此必须根据字体使用不同的查找表.

The page content itself has two important operators for you. The Tf is the font selector, which is important, because it identifies the font object. Each font has its own ToUnicode cmap, therefore depending on the font you must use a different lookup table.

另一个有趣的运算符是文本显示(通常为TJTj).使用Type0(CID)字体时,Tj不包含人类可读的文本,而是包含一系列字形ID,您应该在上述cmap的帮助下将它们映射到Unicode. Tj通常使用十六进制字符串,例如<000100a50056> Tj,而不是您所熟悉的更典型的(Hello, World) Tj.无论哪种方式,字符串都不是人类可读的,并且在不完全分析页面(包括其所有字体资源,尤其是页面)的情况下无法将其提取出来. ToUnicode cmap,它本身就是一个PostScript对象,但是您只关心十六进制部分.

The other interesting operator is the text show (typically TJ or Tj). With Type0 (CID) fonts the Tj doesn't contain human readable text, but instead a sequence of glyph IDs that you are supposed to map into Unicode with the help of the above mentioned cmap. Often the Tj uses hex string, such as <000100a50056> Tj, instead of the more typical (Hello, World) Tj that you are familiar with. Either way, the string is not human readable, and cannot be extracted without fully parsing the page, including all of its font resources, esp. the ToUnicode cmap, which by itself is a PostScript object, but you only care about the hex portions.

我当然简化了此过程,因为有数十种不同的标准编码,自定义编码(差分或ToUnicode),而且我们甚至都没有接触过阿拉伯语,北印度语,垂直日语字体,Type3字体等.有时文本根本无法提取,因为它是故意修饰的.

Of course I have oversimplified the process, because there are dozens of different standard encodings, custom encodings (differential or ToUnicode), and we haven't even touched Arabic, Hindi, vertical Japanese fonts, Type3 fonts, etc. Sometimes the text cannot be extracted at all, because it's intentionally mangled.

这篇关于以编程方式从PDF文件中翻录文本(手动)-缺少一些文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆