使用pdfbox从pdf中删除不可见的文本 [英] remove invisible text from pdf using pdfbox

查看:189
本文介绍了使用pdfbox从pdf中删除不可见的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

链接到pdf

当我尝试从上面的pdf中提取文本时,我得到了一个在evince查看器中看不见的文本混合文本以及可见的文本。此外,一些所需的文本缺少观众中没有遗漏的字符,例如FALCONS中的S和许多缺少的½字符。我认为这是由于隐形文本的干扰,因为在查看器中突出显示pdf时,可以看到隐藏文本与可见文本重叠。

When I try to extract the text from the pdf above, I get a mixture of text that was invisible in the evince viewer as well as text that was visible. In addition, some of the desired text is missing characters that were not missing in the viewer, such as, the 'S' in 'FALCONS' and the many missing '½' characters. I believe this is due to interference from the invisible text because when highlighting the pdf in the viewer, the invisible text can be seen overlapping visible text.

有没有办法删除不可见的文字?或者还有其他解决方案吗?

Is there a way to remove the invisible text? Or is there another solution?

代码:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class App {

    public static String getPdfText(String pdfPath) throws IOException {
        File file = new File(pdfPath);
        PDDocument document = null;
        PDFTextStripper textStripper = null;
        String text = null;

        try {
            document = PDDocument.load(file);
            textStripper = new PDFTextStripper();
            textStripper.setEndPage(1);
            text =  textStripper.getText(document);
        } catch (IOException e) {
            throw new IOException("Could not load file and strip text.", e);
        } finally {
            try {
                if (document != null)
                    document.close();
            } catch (IOException e) {
                System.out.println("Could not close document");
            }
        }

        return text;
    }

    public static void main(String[] args) {
        String filename = "RevTeaser09072016.pdf";
        String text = null;

        try {
            text = getPdfText(filename);
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

        System.out.println(text);
    }
}

输出(粗体文本是所需文本):

Output (bold text is the desired text):


145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as shown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as hown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEEL RS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
14
160
41
15715 156154150 153149 152148 51147
142
158
50
146
S lections
Number of Teams
Amount Bet

ark box as sho n 
 denotes home team
PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:0 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:0 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:0 8 CARDINALS17– ½ 3+ ½
9 BU CANEERS  PM1:0 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:0 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:0 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:0 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:0 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO F OTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,0
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
  FALCON      - 9  1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14  1:00p 18 BEARS + ½
 19 PACKERS - 12  1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9  10:20p 32 49ERS  - 4½


推荐答案

OP的示例PDF中的不可见文本主要是通过定义剪辑路径使不可见(在文本的边界和填充路径(隐藏文本下面)。因此,我们必须在文本提取期间考虑路径相关指令以忽略不可见文本

The invisible text in the OP's sample PDF mostly is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, we have to consider path related instructions during text extraction to ignore that invisible text.

不幸的是,未声明为这些指令设计的回调在 PDFTextStripper 或其父类 LegacyPDFStreamEngine PDFStreamEngine

Unfortunately call backs designed for these instructions are not declared in PDFTextStripper or its parent classes LegacyPDFStreamEngine and PDFStreamEngine.

但它们是在其他主要 PDFStreamEngine 子类 PDFGraphicsStreamEngine ,它们明智地在 PageDrawer 中实现。

But they are declared in the other major PDFStreamEngine subclass PDFGraphicsStreamEngine, and they are sensibly implemented in PageDrawer.

为了利用这一点,我们可以复制&安培;粘贴&将 PageDrawer 实现调整为 PDFTextStripper 的子类,例如:像这样:

To make use of this we, therefore, can copy & paste & adapt the PageDrawer implementation into a subclass of PDFTextStripper, e.g. like this:

public class PDFVisibleTextStripper extends PDFTextStripper {
    public PDFVisibleTextStripper() throws IOException {
        addOperator(new AppendRectangleToPath());
        addOperator(new ClipEvenOddRule());
        addOperator(new ClipNonZeroRule());
        addOperator(new ClosePath());
        addOperator(new CurveTo());
        addOperator(new CurveToReplicateFinalPoint());
        addOperator(new CurveToReplicateInitialPoint());
        addOperator(new EndPath());
        addOperator(new FillEvenOddAndStrokePath());
        addOperator(new FillEvenOddRule());
        addOperator(new FillNonZeroAndStrokePath());
        addOperator(new FillNonZeroRule());
        addOperator(new LineTo());
        addOperator(new MoveTo());
        addOperator(new StrokePath());
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
        Vector end = new Vector(start.getX() + text.getWidth(), start.getY());

        PDGraphicsState gs = getGraphicsState();
        Area area = gs.getCurrentClippingPath();
        if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
            super.processTextPosition(text);
    }

    private GeneralPath linePath = new GeneralPath();

    void deleteCharsInPath() {
        for (List<TextPosition> list : charactersByArticle) {
            List<TextPosition> toRemove = new ArrayList<>();
            for (TextPosition text : list) {
                Matrix textMatrix = text.getTextMatrix();
                Vector start = textMatrix.transform(new Vector(0, 0));
                Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
                if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
                    toRemove.add(text);
                }
            }
            if (toRemove.size() != 0) {
                System.out.println(toRemove.size());
                list.removeAll(toRemove);
            }
        }
    }

    public final class AppendRectangleToPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x = (COSNumber) operands.get(0);
            COSNumber y = (COSNumber) operands.get(1);
            COSNumber w = (COSNumber) operands.get(2);
            COSNumber h = (COSNumber) operands.get(3);

            float x1 = x.floatValue();
            float y1 = y.floatValue();

            // create a pair of coordinates for the transformation
            float x2 = w.floatValue() + x1;
            float y2 = h.floatValue() + y1;

            Point2D p0 = context.transformedPoint(x1, y1);
            Point2D p1 = context.transformedPoint(x2, y1);
            Point2D p2 = context.transformedPoint(x2, y2);
            Point2D p3 = context.transformedPoint(x1, y2);

            // to ensure that the path is created in the right direction, we have to create
            // it by combining single lines instead of creating a simple rectangle
            linePath.moveTo((float) p0.getX(), (float) p0.getY());
            linePath.lineTo((float) p1.getX(), (float) p1.getY());
            linePath.lineTo((float) p2.getX(), (float) p2.getY());
            linePath.lineTo((float) p3.getX(), (float) p3.getY());

            // close the subpath instead of adding the last line so that a possible set line
            // cap style isn't taken into account at the "beginning" of the rectangle
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "re";
        }
    }

    public final class StrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "S";
        }
    }

    public final class FillEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f*";
        }
    }

    public class FillNonZeroRule extends OperatorProcessor {
        @Override
        public final void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f";
        }
    }

    public final class FillEvenOddAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B*";
        }
    }

    public class FillNonZeroAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B";
        }
    }

    public final class ClipEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W*";
        }
    }

    public class ClipNonZeroRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W";
        }
    }

    public final class MoveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;
            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
            linePath.moveTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "m";
        }
    }

    public class LineTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            // append straight line segment from the current point to the point
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;

            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());

            linePath.lineTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "l";
        }
    }

    public class CurveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 6) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x2 = (COSNumber) operands.get(2);
            COSNumber y2 = (COSNumber) operands.get(3);
            COSNumber x3 = (COSNumber) operands.get(4);
            COSNumber y3 = (COSNumber) operands.get(5);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "c";
        }
    }

    public final class CurveToReplicateFinalPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "y";
        }
    }

    public class CurveToReplicateInitialPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x2 = (COSNumber) operands.get(0);
            COSNumber y2 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D currentPoint = linePath.getCurrentPoint();

            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "v";
        }
    }

    public final class ClosePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "h";
        }
    }

    public final class EndPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "n";
        }
    }
}

PDFVisibleTextStripper

请确保使用 PDFVisibleTextStripper 构造函数,而不是具有相同名称的 PageDrawer 使用的类。要确保只需按照代码下的链接。

Please make sure you use the inner operator classes in the PDFVisibleTextStripper constructor, not the classes used by PageDrawer with the same name. To make sure simply follow the link under the code.

这会将输出减少到

REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
 3 FALCONS     - 9½ 1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14½ 1:00p 18 BEARS + ½
 19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9½ 10:20p 32 49ERS  - 4½

它会丢弃大部分不需要的数据。

which drops most of the unwanted data.

这个问题很明显, processTextPosition deleteCharsInPath 计算字符基线的结尾隐含地假设没有页面旋转的水平文本。但是,如果放松一个人的可见性标准,如果其基线的开始可见,则可以假定一个角色是可见的。在这种情况下,不再需要计算出的 Vector结束,并且代码也适用于旋转页面。

In the context of this question it became apparent that the way processTextPosition and deleteCharsInPath calculate the end of a character baseline implicitly assumes horizontal text without page rotation. If one loosens one's criteria for "Visibility", though, one can assume a character to be visible iff the start of its baseline is visible. In that case one does not need that calculated Vector end anymore and the code works ok for rotated pages, too.

这篇关于使用pdfbox从pdf中删除不可见的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆