无法读取跨行突出显示的确切文本 [英] Not able to read the exact text highlighted across the lines

查看:246
本文介绍了无法读取跨行突出显示的确切文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDBox阅读PDF文档中突出显示的内容。我能够以单行和多个单词的形式阅读突出显示的文本。但是,我无法阅读突出显示的文字。请查看以下示例代码以阅读突出显示的文本。

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text.

PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
        for (int i = 0; i < allPages.size(); i++) {
            int pageNum = i + 1;
            PDPage page = (PDPage) allPages.get(i);
            List<PDAnnotation> la = page.getAnnotations();
            if (la.size() < 1) {
                continue;
            }
            System.out.println("Page number : "+pageNum);
            for (PDAnnotation pdfAnnot: la) {
                if (pdfAnnot.getSubtype().equals("Popup")) {
                    continue;
                }

                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDRectangle rect = pdfAnnot.getRectangle();
                float x = rect.getLowerLeftX() - 1;
                float y = rect.getUpperRightY() - 1;
                float width = rect.getWidth();
                float height = rect.getHeight() + rect.getHeight() / 4;

                int rotation = page.findRotation();
                if (rotation == 0) {
                    PDRectangle pageSize = page.getMediaBox();
                    y = pageSize.getHeight() - y;
                }

                Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                stripper.addRegion(Integer.toString(0), awtRect);
                stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
                System.out.println("Annot type = " + pdfAnnot.getSubtype());
                 System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n");
                 System.out.println("Getting text from comment = " + pdfAnnot.getContents());

            }
        }

在读取突出显示的文本时,pdfAnnot.getRectangle()函数返回文本周围的最小矩形区域。这提供了比所需更多的文本。我找不到任何API来提取确切的突出显示文本。

While reading the highlighted text across the lines, "pdfAnnot.getRectangle()" function returns the minimum rectangle area around the text. This gives more text than required. I could not find any API to extract the exact highlighted text.

例如:
从测试PDF文件中提取的文本。

For example: Text extracted from test PDF file.


任何地方的任何人都可以打开 PDF 文件。您只需要免费的 Adob​​e Acrobat

Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat

阅读器。其他文件格式的收件人有时无法打开文件,因为他们

Reader. Recipients of other file formats sometimes can't open files because they

没有用于创建文档的应用程序。

don't have the applications used to create the documents.

用例1:
阅读第一个粗体文本,即 PDF 。阅读单行中突出显示的文本没有问题。将打印正确的文本,如下所示:

输出:
从region = PDF 获取文本

Use case 1: Reading the first bolded text, i.e PDF. No issues in reading the text highlighted in single line. The correct text will be printed as listed below:
Output: Getting text from region = "PDF"

用例2:
阅读第二个粗体文本,即 Adob​​e Acrobat reader ,它分为两行。在这种情况下,运行上述程序的提取文本是:

输出:
从region =任何人,任何地方都可以打开PDF文件。所有你需要的是免费的Adobe Acrobat
Reader。其他文件格式的收件人有时无法打开文件,因为他们

Use case 2: Reading the second bolded text, i.e Adobe Acrobat reader, which spans in two lines. In this case, the extracted text on running the above program is:
Output: Getting text from region = "Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat Reader. Recipients of other file formats sometimes can't open files because they".

getRectangle()API给出了由突出显示的文本包围的最小矩形的坐标。因此,它比Adobe Acrobat Reader更多文本。

The getRectangle() API gives the coordinates of minimum rectangle surrounded by the highlighted text. Hence, it is more text than "Adobe Acrobat Reader".


  1. 如何知道提取中突出显示的起点和终点

  2. 如何知道提取区域中的行数。

任何帮助我将非常感激。

Any help will be highly appreciated.

推荐答案

我设法使用以下代码提取突出显示的文本。

I managed to extract the highlighted text by using the following code.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}

这篇关于无法读取跨行突出显示的确切文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆