Java：Apache PDFbox提取突出显示的文本 [英] Java: Apache PDFbox Extract highlighted text

查看：189 发布时间：2018/12/6 14:52:24 java pdf pdfbox

本文介绍了Java：Apache PDFbox提取突出显示的文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Apache PDFbox库从PDF文件中提取突出显示的文本（即黄色背景）。我对这个库完全是新手，并且不知道它用于此目的的哪个类。
到目前为止，我已使用以下代码从评论中提取文本。

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose. So far I have done extraction of text from comments using below code.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List<PDAnnotation> la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("\nProcess Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

现在我需要获得突出显示的文本，任何代码示例都将受到高度赞赏。

Now I need to get the highlighted text, any code example will be highly appreciated.

推荐答案

问题中的代码无法阅读行中突出显示的确切文本已经说明了用于从PDFBox页面上的有限内容区域中提取文本的大多数概念。

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.

研究了此代码后， OP在评论中仍然感到疑惑：

Having studied this code, the OP still wondered in a comment:

但我感到困惑的一件事是 QuadPoints 而不是矩形即可。正如你在评论中提到的那样。这是什么，你可以用一些代码行或简单的词来解释它，因为我也面临同样的多行高亮问题？

But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?

一般来说，注释引用的区域是一个矩形：

In general the area an annotation refers to is a rectangle:

Rect rectangle （必需）注释矩形，以默认用户空间单位定义页面上注释的位置。

Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.

（来自表164 - 常用条目到所有注释词典 - 在ISO 32000-1中）

对于某些注释类型（例如文本标记），此位置值不够，因为：

For some annotations types (e.g. text markups), this location value does not suffice because:

标记文本可能会以某个奇数角写入，但提到的矩形类型在说明书中指的是边缘平行于页面边缘的矩形;和

标记的文本可以从一行中的任何地方开始，然后在另一行中的任何地方结束，因此标记区域根本不是矩形，而是多个矩形部分的并集。

text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.

因此，为了处理这些注释类型，PDF规范提供了一种更通用的方法来定义区域：

To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:

QuadPoints array （必需）一个8×n数字的数组，用于指定默认用户空间中n个四边形的坐标。每个四边形应包含注释背后的文本中的一个或一组连续单词。每个四边形的坐标应按顺序给出

QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order

x ₁ y ₁ x ₂ y ₂ x ₃ y ₃ x ₄ y ₄

x₁ y₁ x₂ y₂ x₃ y₃ x₄ y₄

以逆时针顺序指定四边形的四个顶点（参见图64）。文本应相对于边连接点（x ₁，y ₁）和（x ₂，y _2））。

specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x₁, y₁) and (x₂, y₂).

（来自表179 - 特定于文本标记注释的附加条目 - 在ISO 32000-1中）

因此，而不是由

PDRectangle rect = pdfAnnot.getRectangle();

在参考问题，您必须考虑由

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

并相应地定义 PDFTextStripperByArea stripper 的区域。不幸的是 PDFTextStripperByArea.addRegion 期望一个矩形作为参数，而不是一些通用的四边形。由于文本通常是水平或垂直打印，因此不会造成太大问题。

and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.

PS 关于 QuadPoints ，现实PDF中的顺序可能不同，参见问题 PDF规范与Acrobat创建（QuadPoints）。

PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

这篇关于Java：Apache PDFbox提取突出显示的文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Java：Apache PDFbox提取突出显示的文本 [英] Java: Apache PDFbox Extract highlighted text

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java：Apache PDFbox提取突出显示的文本 [英] Java: Apache PDFbox Extract highlighted text

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭