PDFBox 2.0:旋转页面上的不可见线条 - 剪辑路径问题 [英] PDFBox 2.0: invisible lines on rotated page - clip path issue

查看:117
本文介绍了PDFBox 2.0:旋转页面上的不可见线条 - 剪辑路径问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文件示例:点击此处

使用 this 主题中的出色解决方案我尝试提取可见文本.附件的文本非常小,这可能会导致此剪辑路径问题,其中某些部分的字母可能被隐藏.对于这样的旋转文本,我从链接问题中稍微更改了代码:

Using great solution from this topic I try to extract visible text. Attached document has very small text which maybe cause this clip path problem where some part of letters could be hidden. For such rotated text I changed code from linked issue a bit:

    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState gs = getGraphicsState();                            

        Vector center = getTextPositionCenterPoint(text);
        Area area = gs.getCurrentClippingPath();
        if (area == null || area.contains(lowerLeftX + center.getX(), lowerLeftY + center.getY())) {            
            nonStrokingColors.put(text, gs.getNonStrokingColor());
            renderingModes.put(text, gs.getTextState().getRenderingMode());
            super.processTextPosition(text);
        }
    }


private Vector getTextPositionCenterPoint(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
        Vector center = null;
        switch (rotation) {
        case 0:
            center = new Vector(start.getX() + text.getWidth()/2, start.getY()); 
            break;
        case 90:
            center = new Vector(start.getX(), start.getY() + text.getWidth()/2);
            break;
        case 180:
            center = new Vector(start.getX() - text.getWidth()/2, start.getY());
            break;
        case 270:
            center = new Vector(start.getX(), start.getY() - text.getWidth()/2);
            break;
        default:
            center = new Vector(start.getX() + text.getWidth()/2, start.getY());
            break;
        }

        return center;
    }

我正在尝试做的 - 根据旋转获取字符 X 中心点(我知道有时由于文本方向而无法正常工作,但是在这里看起来并非如此)但是在应用此解决方案后,由于剪辑路径,我跳过了底部的第二行、第三行和其他一些行.我想知道我的错误在哪里.提前致谢!

What I'm trying to do - get character X-center point depending on rotation (I'm aware that sometimes this does not work because of text direction, however here it looks like this is not the case) But after applying this solution I have 2nd, 3rd and some others rows in the bottom skipped because of clip path. I'm wondering where is my mistake. Thanks in advance!

推荐答案

PDF 的问题是由以下因素共同引起的

Problems with your PDF are caused by a combination of

  • 文本坐标正好在剪辑路径边框上;
  • 文本坐标和剪辑路径坐标的不同计算路径具有不同的浮点误差,导致剪辑路径边界上的文本坐标有时被计算为剪辑路径之外.

不幸的是,您尝试更改此设置并没有帮助:问题文本的基线与剪辑路径边界重合,而您的 getTextPositionCenterPoint 仅沿基线居中,因此居中的点正好有问题字形来源有问题.

Your attempt to change this unfortunately does not help here: The problem texts have their baseline coinciding with the clip path border, and your getTextPositionCenterPoint only centers along the baseline, so the centered point has issues exactly of the glyph origin has problems.

另一种解决方法效果更好:使用胖点比较.这意味着我们不检查给定的点 xy 是否在剪辑区域中,而是检查这些坐标周围的小矩形是否与剪辑区域相交.如果坐标因浮点错误而偏离剪辑区域,这足以在剪辑区域中找到它们.

A different work around works better: using a fat point comparison. That means that instead of checking whether a given point x, y is in the clip area, we check whether a small rectangle around those coordinates intersects the clip area. In case of coordinates wandering out of the clip area due to floating point errors, this suffices to find them in the clip area nonetheless.

为此,我们将 processTextPosition 中的 area.contains(x, y) 检查替换为 contains(area, x, y)> 实现为

To do this, we replace the area.contains(x, y) checks in processTextPosition by contains(area, x, y) which is implemented as

protected boolean contains(Area area, float x, float y) {
    double length = .0002;
    double up = 1.0001;
    double down = .9999;
    return area.intersects(x < 0 ? x*up : x*down, y < 0 ? y*up : y*down, Math.abs(x*length), Math.abs(y*length));
}

(PDFVisibleTextStripper 辅助方法)

(PDFVisibleTextStripper helper method)

(实际上,这里坐标周围矩形的选择有点随意,这个选择对我来说很有效.)

(Actually the choice of the rectangle around the coordinates here is somewhat arbitrary, the choice simply worked for me.)

通过此更改,我将丢失 第二行、第三行和其他一些行在底部,参见.测试 ExtractVisibleText.testFat1.

With this change I get your missing 2nd, 3rd and some others rows in the bottom, cf. the test ExtractVisibleText.testFat1.

这篇关于PDFBox 2.0:旋转页面上的不可见线条 - 剪辑路径问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆