使用Java从pdf文件提取文本时面临的问题 [英] Facing issues on extracting text from pdf file using java

查看:484
本文介绍了使用Java从pdf文件提取文本时面临的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

无法从具有客户加密字体的pdf中提取文本,该字体可以通过Adobe Reader中的文件"->属性"->字体"进行标识. 其中一种字体被提及为, C0EX02Q0_22 类型:3型 编码:自定义 实际字体:C0EX02Q0_22 实际字体类型:Type 3

Not able to extract the text from the pdf which has Customer encryption fonts, which can identify by File -> Properties -> Font in Adobe reader. One of the font is mention as, C0EX02Q0_22 Type: Type 3 Encoding: Custom Actual Font: C0EX02Q0_22 Actual Font type: Type 3

让我知道有什么方法可以从此类pdf文件中提取文本内容. 目前,我正在使用pdf实用工具中的PDFText2HTML. 提取此类pdf文件时,获取诸如ÁÙÅ@ÅÕãÉ"之类的值

Let me know is there any way to to extract the text content from such pdf files. Currently i am using PDFText2HTML from pdf util. Get the values like 'ÁÙÅ@ÅÕãÉ' while extracting such pdf files

样本pdf: tesis completa.pdf

在此pdf中,您可以看到所使用的字体具有自定义编码,例如:T3Font_1(请按File-> Properties-> Adob​​e Reader中的Font进行引用) 由于我无法上传我的pdf文件,因此更新了具有相同问题的示例

In this pdf you could see the fonts used having custom encoding Eg: T3Font_1 (Please refer by File -> Properties -> Font in Adobe reader) Since i could not upload the my pdf updated the sample one having same issue

推荐答案

标准中所述的提取

PDF规范 ISO 32000-1 在第9.10节提取文本内容中描述了如果PDF提供了所需的信息并且正确地提取了文本,则如何提取文本.

Extraction as described in the standard

The PDF specification ISO 32000-1 describes in section 9.10 Extraction of Text Content how text extraction can be done if the PDF provides the required information and does so correctly.

但是,使用此算法仅在文档的几个页面范围内起作用(即摘要,内容列表,感谢函和Publicación7部分),而在其他范围内则导致乱码,例如8QLYHUVLWDWGH/OHLGD代替Universitat de Lleida.查看有问题的PDF对象可以清楚地看到缺少必需的信息(没有 ToUnicode 映射,并且 Encoding 是基于 WinAnsiEncoding 的,使用中的位置通过差异映射到非标准名称).

Using this algorithm, though, only works in a few page ranges of the document (namely the summaries, the content lists, the thank-yous, and the section Publicación 7) but in the other ranges results in gibberish, e.g. 8QLYHUVLWDWGH/OHLGD instead of Universitat de Lleida. Looking at the PDF objects in question makes clear that the required information are missing (no ToUnicode map and while the Encoding is based on WinAnsiEncoding, all positions in use are mapped via Differences to non-standard names).

也尝试使用Adobe Reader中的复制粘贴来提取文本,这会产生乱码.通常,这表明不可能进行通用提取.

Also trying to extract the text using copy&paste from Adobe Reader returns that gibberish. This generally is a sign that generic extraction is not possible.

但是,检查PDF对象和常规文本提取尝试的输出会产生这样的想法:对于所有使用的字体,提取为乱码的文本的实际编码是相同的,并且它是基于ASCII的编码偏移一个常数:将'U' - '8'添加到提取的8QLYHUVLWDWGH/OHLGD的每个字符中会得到Universitat de Lleida.只要文本仅使用ASCII字符,向从文档中其他位置提取的文本的char中添加相同的常量也将得到正确的文本.

Inspecting the PDF objects and the outputs of the generic text extraction attempt, though, gives rise to the idea that the actual encoding for the text extracted as gibberish is the same for all fonts used, and that it is some ASCII-based encoding shifted by a constant: Adding 'U' - '8' to each character of the extracted 8QLYHUVLWDWGH/OHLGD results in Universitat de Lleida. Adding the same constant to the chars from text extracted elsewhere in the document also results in correct text as long as the text only uses ASCII characters.

使用这种简单方法无法正确映射ASCII范围之外的字符,但它们似乎总是被提取为相同的错误字符,例如字形ó"总是提取为"y".

Characters outside the ASCII range are not mapped correctly by that simple method, but they also always seem to be extracted as the same wrong character, e.g. the glyph 'ó' always is extracted as 'y'.

因此,您可以通过以下方法从该(以及类似创建的)文档中提取文本:首先使用标准算法提取文本,然后在乱码部分(可能可以通过字体名称标识)中,通过添加表示较小的值,并根据某些映射替换为较大的值.

Thus, you can extract the text from that (and similarly created) documents by first extracting the text using the standard algorithm and then in the gibberish sections (which probably can be identified by font name) replacing each character by adding 'U' - '8' for small values and by replacing according to some mapping for higher values.

正如您在问题中提到的Java一样,我已经通过iText和PDFBox文本提取来运行您的文档,而是否需要按'U' - '8'进行移位,结果看起来很有希望.我认为其他通用Java PDF库也可以使用.

As you mentioned Java in your question, I have run your document through iText and PDFBox text extraction with and without shifting by 'U' - '8', and the results look promising. I assume other general-purpose Java PDF libraries will also work.

除了创建自定义提取例程外,您还可以尝试通过将 ToUnicode 映射条目添加到相关字体来修复相关PDF.之后,普通的文本提取程序应该能够正确提取内容.

Instead of creating custom extraction routines, you can try to fix the PDFs in question by adding ToUnicode map entries to the fonts in question. After that normal text extraction programs should be able to properly extract the contents.

这篇关于使用Java从pdf文件提取文本时面临的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆