一个Java库，用于从PDF文档中提取文本，保留空白和行 [英] A Java Library for text extraction from PDF documents preserving empty spaces and lines

查看：272 发布时间：2020/5/25 5:05:33 java pdf text

本文介绍了一个Java库，用于从PDF文档中提取文本，保留空白和行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您知道一个Java库，通过该库我可以将PDF文档的文本提取为字符串，并且还保留原始文档中的所有空白行和空白(如它们在pdf文档中一样)?

do you know a Java library, with which I can extract the text of a PDF document as a string, and which also preserves all empty lines and empty spaces from the original document (as they appear in the pdf document)?

我现在正在使用PDFBox-0.7.3库中的PDFTextStripper类，并且使用了getText()方法，该方法的确以字符串形式返回了文档，但是，它也删除了所有空行，制表符和所有内容文本之间的空白.保留了新行，因此我可以识别文档的结构，但是，对我来说，保留其他空白也是很重要的.这是getText()的默认行为，似乎无法使其工作以保留空白文本(为此目的，我在API中找不到任何方法).

I am using right now the PDFTextStripper class from the PDFBox-0.7.3 library, and I use the getText() method, which does return the document as a string, however, it removes also all empty lines, tabs and any empty spaces between the text. The new lines are preserved, so I can recognize the structure of the document, however, it is important for me to keep the other empty stuff as well. This is the default behaviour of getText(), and it seems that it is not possible to make it work so that it preserve the empty pieces of the text (I could not find any method in the API for this purpose).

感谢您的帮助.

一个Java库，用于从PDF文档中提取文本，保留空白和行 [英] A Java Library for text extraction from PDF documents preserving empty spaces and lines

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

一个Java库，用于从PDF文档中提取文本，保留空白和行 [英] A Java Library for text extraction from PDF documents preserving empty spaces and lines

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭