使用ICEpdf在PDF页面的特定区域中提取文本 [英] Extracting text in a specific region of PDF page using ICEpdf

查看:407
本文介绍了使用ICEpdf在PDF页面的特定区域中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用ICEpdf提取特定区域的文本?我能够提取整个页面,但这不是我想要的.

Is there any way to extract the text of a specific region using ICEpdf? I was able to extract whole pages, but that's not what I want to do.

(我知道PDFBox很好地提取了页面特定矩形区域中的文本.但是,由于图像渲染在ICEpdf中效果更好,因此我想使用该库.)

(I know PDFBox nicely extracts the text in a specific rectangular area of a page. However, since the image rendering works a lot better in ICEpdf, I'd like to use that library.)

推荐答案

在表示页面的Page对象上,可以调用该方法:

ON the Page object that represents a page you can call the method:

PageText pageText = document.getPageText(pagNumber);

类似于捆绑包示例./examples/extraction/PageTextExtraction.java

Similar to the bundle example ./examples/extraction/PageTextExtraction.java

PageText对象包含页面的所有LineText-> WordText-> GlyphText对象. LineText,WordText和GlyphText都扩展了AbstractText,它具有getBounds()方法.这些对象的边界位于PDF用户空间(第一个几何象限)中. Java2D在第四几何象限中.假设您已经具有selectionRectangle,则代码如下:

The PageText object contains all the LineText->WordText->GlyphText objects for the page. LineText, WordText and GlyphText all extend AbstractText which has a getBounds() method. The bounds of these objects are in PDF user space, the 1st geometric quadrant. Java2D is in the 4th geometric quadrant. Assuming you already have the selectionRectangle the code would be as follows:


//  the currently selected state, ignore highlighted.
currentPage.getViewText().clearSelected();

// get page transform, same for all calculations
AffineTransform pageTransform = currentPage.getPageTransform(
        Page.BOUNDARY_CROPBOX,
        documentViewModel.getViewRotation(),
        documentViewModel.getViewZoom());

Rectangle2D.Float pageSpaceSelectRectangle =
        convertRectangleToPageSpace(selectionRectangle, pageTransform);
ArrayList pageLines = pageText.getPageLines();
for (LineText pageLine : pageLines) {
    // check for containment, if so break into words.
    if (pageLine.getBounds().intersects(pageSpaceSelectRectangle )) {
        // you have some selected text. 
    }
}



    /**
     * Converts the rectangle to the space specified by the page tranform. This
     * is a utility method for converting a selection rectangle to page space
     * so that an intersection can be calculated to determine a selected state.
     *
     * @param mouseRect     rectangle to convert space of
     * @param pageTransform page transform
     * @return converted rectangle.
     */
    private Rectangle2D convertRectangleToPageSpace(Rectangle mouseRect,
                                                    AffineTransform pageTransform) {
        GeneralPath shapePath;
        try {
            AffineTransform tranform = pageTransform.createInverse();
            shapePath = new GeneralPath(mouseRect);
            shapePath.transform(tranform);
            return shapePath.getBounds2D();
        } catch (NoninvertibleTransformException e) {
            logger.log(Level.SEVERE,
                    "Error converting mouse point to page space.", e);
        }
        return null;
    }

这篇关于使用ICEpdf在PDF页面的特定区域中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆