按列提取PDF文本 [英] extract PDF text by columns

查看：106 发布时间：2020/5/25 4:51:43 pdf pdfbox

本文介绍了按列提取PDF文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题是:

我如何从PDF文件中提取文本，该文件分为几列，以使结果被这些列分开?

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

背景: 我从事有关文本分析(特别是科学文本)的项目. 这些文本有时以多列布局发布，每列具有单独的页码. 要按布局的页码对提取的文本进行排序，按列提取文本将很有用.

Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.

我使用 pdfBox 并尝试/搜索了以下内容:

I use pdfBox and tried / searched for several things:

我尝试了PDPage类->结果的getThreadBeads()方法:大小为0的列表
我尝试使用getCharactersByArticle()方法->不按列划分的文本
(我尝试使用已发布文本的pdf文件以及自己创建的基于.doc的文件进行尝试，每个文件都具有多列布局)

I tried the getThreadBeads() method of the PDPage class -> result: list with 0 size
I tried graping the text with the getCharactersByArticle() method -> text not divided in columns
(I tried this with pdf files of published texts as well as with self created .doc based files, each have a multiple column layout)

问题是pdfBox似乎自动将文本按列划分: 如果将PDFTextStripper的setSortByPosition()设置为true，则页面的所有符号都设置在一行中，而不会识别单独的列. 但是，如果我将setSortByPosition()设置为false，则剥离器将进行此除法.

The thing is that pdfBox seems to divide the text by columns automatically: If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

为此，我看了一下pdfBox源代码: 关键方法是PDFTextStripper的writePage()方法. 这里的空格(大多数pdf中未提供)和换行符显然是计算得出的. 但是我找不到汽提塔如何计算分行符.

For that I had a look to the pdfBox source code: The crucial method is the writePage() method of PDFTextStripper. Here spaces (which are not given in most pdfs) and line breaks are calculated obviously. But I couldn't find how the Stripper is calculating the column breaks.

那么问题又来了:

PDFTextStripper如何计算分行符?
pdfBox API中是否有方法可以捕获此/按列提取文本?
其他pdf-api是否可能?

预先感谢

按列提取PDF文本 [英] extract PDF text by columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

按列提取PDF文本 [英] extract PDF text by columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭