Java:Apache PDFbox提取突出显示的文本 [英] Java: Apache PDFbox Extract highlighted text

查看:189
本文介绍了Java:Apache PDFbox提取突出显示的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache PDFbox库从PDF文件中提取突出显示的文本(即黄色背景)。我对这个库完全是新手,并且不知道它用于此目的的哪个类。
到目前为止,我已使用以下代码从评论中提取文本。

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose. So far I have done extraction of text from comments using below code.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List<PDAnnotation> la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("\nProcess Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

现在我需要获得突出显示的文本,任何代码示例都将受到高度赞赏。

Now I need to get the highlighted text, any code example will be highly appreciated.

推荐答案

问题中的代码无法阅读行中突出显示的确切文本已经说明了用于从PDFBox页面上的有限内容区域中提取文本的大多数概念。

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.

研究了此代码后, OP在评论中仍然感到疑惑:

Having studied this code, the OP still wondered in a comment:


但我感到困惑的一件事是 QuadPoints 而不是矩形即可。正如你在评论中提到的那样。这是什么,你可以用一些代码行或简单的词来解释它,因为我也面临同样的多行高亮问题?

But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?

一般来说,注释引用的区域是一个矩形:

In general the area an annotation refers to is a rectangle:


Rect rectangle (必需)注释矩形,以默认用户空间单位定义页面上注释的位置。

Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.

(来自表164 - 常用条目到所有注释词典 - 在ISO 32000-1中)

对于某些注释类型(例如文本标记),此位置值不够,因为:

For some annotations types (e.g. text markups), this location value does not suffice because:


  • 标记文本可能会以某个奇数角写入,但提到的矩形类型在说明书中指的是边缘平行于页面边缘的矩形;和

  • 标记的文本可以从一行中的任何地方开始,然后在另一行中的任何地方结束,因此标记区域根本不是矩形,而是多个矩形部分的并集。

  • text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
  • text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.

因此,为了处理这些注释类型,PDF规范提供了一种更通用的方法来定义区域:

To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:


QuadPoints array (必需)一个8×n数字的数组,用于指定默认用户空间中n个四边形的坐标。每个四边形应包含注释背后的文本中的一个或一组连续单词。每个四边形的坐标应按顺序给出

QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order

x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4

x1 y1 x2 y2 x3 y3 x4 y4

以逆时针顺序指定四边形的四个顶点(参见图64)。文本应相对于边连接点(x 1 ,y 1 )和(x 2 ,y 2) )。

specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).

(来自表179 - 特定于文本标记注释的附加条目 - 在ISO 32000-1中)

因此,而不是由

PDRectangle rect = pdfAnnot.getRectangle();

参考问题,您必须考虑由

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

并相应地定义 PDFTextStripperByArea stripper 的区域。不幸的是 PDFTextStripperByArea.addRegion 期望一个矩形作为参数,而不是一些通用的四边形。由于文本通常是水平或垂直打印,因此不会造成太大问题。

and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.

PS 关于 QuadPoints ,现实PDF中的顺序可能不同,参见问题 PDF规范与Acrobat创建(QuadPoints)

PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

这篇关于Java:Apache PDFbox提取突出显示的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆