如何使用PDFBox协调这些文本位置和行位置? [英] How do I reconcile these text positions and line positions with PDFBox?

查看:414
本文介绍了如何使用PDFBox协调这些文本位置和行位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大文档,但我在此处提取了麻烦的页面.我为表格中的行返回的y坐标似乎超出了文本的坐标.似乎正在进行一些转换,但我找不到它.如果可能的话,我想在下面扩展的PDFGraphicsStreamEngine范围内解决该问题,而不必与PDFBox中可用的其他输入流一起回到绘图板上.

I am working with a large document, but I have extracted the page giving trouble here. The y-coordinates I get back for the lines in the table seem to be stretched beyond the coordinates of the text. There seems to be some transformation going on, but I cannot find it. If possible I would like to fix the problem within the scope of the PDFGraphicsStreamEngine as extended below, and not have to go back to the drawing board with the other input streams available in PDFBox.

我扩展了PDFTextStripper来获取页面上每个文本标志符号的位置:

I have extended PDFTextStripper to acquire the location of every text glyph on the page:

public class MyPDFTextStripper extends PDFTextStripper {

    private List<TextPosition> tps;

    public MyPDFTextStripper() throws IOException {
        tps = new ArrayList<>();
    }

    @Override
    protected void writeString
            (String text,
             List<TextPosition> textPositions)
            throws IOException {
        tps.addAll(textPositions);
    }

    List<TextPosition> getTps() {
        return tps;
    }
}

并且我已经扩展了PDFGraphicsStreamEngine以便将页面上的每一行提取为Line2D:

and I have extended PDFGraphicsStreamEngine to extract every line on the page as a Line2D:

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private List<Line2D> lines;

    LineCatcher(PDPage page)
    {
        super(page);
        lines = new ArrayList<>();
    }

    List<Line2D> getLines() {
        return lines;
    }

    @Override
    public void strokePath() throws IOException
    {
        Rectangle2D rect = linePath.getBounds2D();
        Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
                rect.getX() + rect.getWidth(),
                rect.getY() + rect.getHeight());
        lines.add(line);
        linePath.reset();
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {linePath.moveTo(x, y);}
    @Override
    public void lineTo(float x, float y) throws IOException
    {linePath.lineTo(x, y);}
    @Override
    public Point2D getCurrentPoint() throws IOException
    {return linePath.getCurrentPoint();}

    //all other overridden methods can be left empty for the purposes of this problem.
}

我编写了一个简单的程序来演示该问题:

I have written a simple program to demonstrate the problem:

public class PageAnalysis {
    public static void main(String[] args) {
        try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
            PDPage page = doc.getPage(0);

            MyPDFTextStripper ts = new MyPDFTextStripper();
            ts.getText(doc);
            List<TextPosition> tps = ts.getTps();

            System.out.println("Y coordinates in text:");
            Set<Integer> ySet = new HashSet<>();
            for (TextPosition tp: tps) {
                ySet.add((int)tp.getY());
            }
            List<Integer> yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();


            System.out.println("Y coordinates in lines:");
            LineCatcher lineCatcher = new LineCatcher(page);
            lineCatcher.processPage(page);
            List<Line2D> lines = lineCatcher.getLines();
            ySet = new HashSet<>();
            for (Line2D line: lines) {
                ySet.add((int)line.getY2());
            }
            yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

输出为:

Y coordinates in text:
66  79  106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780 
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713

文本列表中的最后一个数字对应于底部页面编号的y坐标.我似乎找不到行的y坐标发生了什么,尽管似乎是那些已经转换了(此处的媒体框与文本相同,并且适合于文本位置) .当前的转换矩阵的yScaling也具有1.0.

The last number in the text list corresponds to the y-coordinate of the page number at the bottom. I cannot find what is going on with the y-coordinates of the lines, though it seems to be those which have been transformed (the media box is the same here as it was for the text, and it fits in with the text positions). The current transformation matrix has 1.0 for yScaling also.

推荐答案

实际上,PDFTextStripper有将坐标转换为非常不PDF的坐标系的坏习惯,其中一个坐标系的原点位于页面和y坐标向下增加.

Indeed, the PDFTextStripper has the bad habit of transforming coordinates into a very un-PDF'ish coordinate system, one with the origin in the upper left of the page and y coordinates increasing downwards.

因此,对于TextPosition tp,您应该使用

For a TextPosition tp, therefore, you should not use

tp.getY()

代替

tp.getTextMatrix().getTranslateY()

不幸的是,即使这些坐标更接近于实际的PDF默认坐标系,也可以转换. 此答案:这些坐标仍然被转换为原点位于裁剪框的左下角.

Unfortunately these coordinates still may be translated even though they are nearer to the actual PDF default coordinate system, cf. this answer: These coordinates still are transformed to have the origin in the lower left corner of the crop box.

因此,您确实需要这样的东西:

Thus, you really need something like this:

tp.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

其中cropBox是检索为

PDRectangle cropBox = doc.getPage(n).getCropBox();

其中n是包含该内容的页面的编号.

where in turn n is the number of the page with that content.

这篇关于如何使用PDFBox协调这些文本位置和行位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆