按列提取PDF文本 [英] extract PDF text by columns

查看:106
本文介绍了按列提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是:

我如何从PDF文件中提取文本,该文件分为几列,以使结果被这些列分开?

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

背景: 我从事有关文本分析(特别是科学文本)的项目. 这些文本有时以多列布局发布,每列具有单独的页码. 要按布局的页码对提取的文本进行排序,按列提取文本将很有用.

Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.

我使用 pdfBox 并尝试/搜索了以下内容:

I use pdfBox and tried / searched for several things:

  • 我尝试了PDPage类->结果的getThreadBeads()方法:大小为0的列表
  • 我尝试使用getCharactersByArticle()方法->不按列划分的文本
    (我尝试使用已发布文本的pdf文件以及自己创建的基于.doc的文件进行尝试,每个文件都具有多列布局)
  • I tried the getThreadBeads() method of the PDPage class -> result: list with 0 size
  • I tried graping the text with the getCharactersByArticle() method -> text not divided in columns
    (I tried this with pdf files of published texts as well as with self created .doc based files, each have a multiple column layout)

问题是pdfBox似乎自动将文本按列划分: 如果将PDFTextStrippersetSortByPosition()设置为true,则页面的所有符号都设置在一行中,而不会识别单独的列. 但是,如果我将setSortByPosition()设置为false,则剥离器将进行此除法.

The thing is that pdfBox seems to divide the text by columns automatically: If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

为此,我看了一下pdfBox源代码: 关键方法是PDFTextStripper的writePage()方法. 这里的空格(大多数pdf中未提供)和换行符显然是计算得出的. 但是我找不到汽提塔如何计算分行符.

For that I had a look to the pdfBox source code: The crucial method is the writePage() method of PDFTextStripper. Here spaces (which are not given in most pdfs) and line breaks are calculated obviously. But I couldn't find how the Stripper is calculating the column breaks.

那么问题又来了:

  • PDFTextStripper如何计算分行符?
  • pdfBox API中是否有方法可以捕获此/按列提取文本?
  • 其他pdf-api是否可能?

预先感谢

推荐答案

如果我将PDFTextStripper的setSortByPosition()设置为true,则页面的所有符号都设置在一行中,而不会识别单独的列.但是,如果我将setSortByPosition()设置为false,则剥离器将进行此划分.

If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

[...] PDFTextStripper如何计算分行符?

[...] How is PDFTextStripper calculating column breaks?

不是.

通过将SortByPosition设置为false,您告诉PDFBox 尝试对页面内容流中的文本片段进行排序,而是按照出现的顺序接受它们.

By setting SortByPosition to false you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.

在您的文档中,文本似乎是按阅读顺序绘制的,即逐列地绘制.并非所有文档都是如此,为了处理其他文档,PDFBox提供了从左到右,从上到下对文本片段进行排序的选项.

In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.

激活文档中的该选项(将SortByPosition设置为true)将返回文本,而与列无关.

Activating that option (setting SortByPosition to true) in your document returns the text without respect to the columns.

pdfBox API中是否有方法可以捕获此/按列提取文本?

Are there methods in the pdfBox API to catch this / to extract the text by columns?

PDFBox不会分析页面内容以识别列.但是,如果您进行分析,则如果您将列矩形提供为规则,则可以逐列提取文本.

PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.

这篇关于按列提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆