一个Java库,用于从PDF文档中提取文本,保留空白和行 [英] A Java Library for text extraction from PDF documents preserving empty spaces and lines

查看:272
本文介绍了一个Java库,用于从PDF文档中提取文本,保留空白和行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您知道一个Java库,通过该库我可以将PDF文档的文本提取为字符串,并且还保留原始文档中的所有空白行和空白(如它们在pdf文档中一样)?

do you know a Java library, with which I can extract the text of a PDF document as a string, and which also preserves all empty lines and empty spaces from the original document (as they appear in the pdf document)?

我现在正在使用PDFBox-0.7.3库中的PDFTextStripper类,并且使用了getText()方法,该方法的确以字符串形式返回了文档,但是,它也删除了所有空行,制表符和所有内容文本之间的空白.保留了新行,因此我可以识别文档的结构,但是,对我来说,保留其他空白也是很重要的.这是getText()的默认行为,似乎无法使其工作以保留空白文本(为此目的,我在API中找不到任何方法).

I am using right now the PDFTextStripper class from the PDFBox-0.7.3 library, and I use the getText() method, which does return the document as a string, however, it removes also all empty lines, tabs and any empty spaces between the text. The new lines are preserved, so I can recognize the structure of the document, however, it is important for me to keep the other empty stuff as well. This is the default behaviour of getText(), and it seems that it is not possible to make it work so that it preserve the empty pieces of the text (I could not find any method in the API for this purpose).

感谢您的帮助.

推荐答案

您确定文档中有换行符,制表符和空格字符吗?我遇到的许多PDF都使用定位来间隔和缩进.因此,不包括换行符和选项卡,而是将文本对象简单地放置在页面和偏移量的下方.在这种情况下,PDFBox不会从文本中删除任何内容,因此空格永远不会存在.

Are you sure there are line feeds, tabs, space characters in the document? Many of the PDFs I've encountered used positioning for spacing and indentation. So rather than include line feeds and tabs, the text object is simply placed further down the page and offset. In that case PDFBox isn't removing anything from the text, the spaces were never there.

如果您尚未查看PDF源,那可能会有所帮助.如果压缩,则可以使用多价 PDF规范在9.4.2节中描述了文本定位运算符. .

If you haven't looked at the PDF source yet, that could be helpful. If it's compressed you can use Multivalent Uncompress to make it readable. The PDF specification describes the text-positioning operators in section 9.4.2.

这篇关于一个Java库,用于从PDF文档中提取文本,保留空白和行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆