如何使用PDFBox在pdf中查找表格边框线? [英] How to find table border lines in pdf using PDFBox?

查看:1398
本文介绍了如何使用PDFBox在pdf中查找表格边框线?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找pdf中的表格边框线.我使用pdfBox的PrintTextLocations类制作单词.现在,我正在寻找构成表格的不同线条的坐标.我尝试使用org.apache.pdfbox.pdfviewer.PageDrawer,但是找不到包含这些行的任何字符/图形.我尝试了两种方法:

第一:

Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);

它给了我空指针异常.因此,第二,我尝试覆盖processStream函数,但是我无法获得任何笔触.请帮我.我可以使用任何其他可以为我提供表中各行的坐标的库.另一个快速的问题是,pdfbox中的那些表边界线是什么样的对象?这些是图形还是这些字符?

这是我尝试解析的样本pdf的链接: http://stats.bls.gov/news.release/pdf/empsit.pdf 并尝试获取第8页上的表格行.

编辑:我遇到另一个问题,在解析此pdf的页码1时,尽管strokePath()函数的pathIterator为空,但我却无法获得任何行每行都被调用.如何使用此pdf文件?

解决方案

在1.8.*版本中,PDFBox解析功能的实现不是非常通用的,特别是OperatorProcessor实现与特定的解析器类紧密相关. ,例如假定与PageDrawer实例进行交互的处理路径绘制操作的实现.

因此,除非有人想复制&将所有这些OperatorProcessor类都进行细微的粘贴,就必须从这样一个特定的解析器类派生出来.

因此,在您对路径绘制操作感兴趣之后,我们也将从PageDrawer派生我们的解析器:

public class PrintPaths extends PageDrawer
{
    //
    // constructor
    //
    public PrintPaths() throws IOException
    {
        super();
    }

    //
    // method overrides for mere path observation
    //
    // ignore text
    @Override
    protected void processTextPosition(TextPosition text) { }

    // ignore bitmaps
    @Override
    public void drawImage(Image awtImage, AffineTransform at) { }

    // ignore shadings
    @Override
    public void shFill(COSName shadingName) throws IOException { }

    @Override
    public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
    {
        PDRectangle cropBox = aPage.findCropBox();
        this.pageSize = cropBox.createDimension();
        super.processStream(aPage, resources, cosStream);
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        printPath();
        System.out.printf("Fill; windingrule: %s\n\n", windingRule);
        getLinePath().reset();
    }

    @Override
    public void strokePath() throws IOException
    {
        printPath();
        System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
        getLinePath().reset();
    }

    void printPath()
    {
        GeneralPath path = getLinePath();
        PathIterator pathIterator = path.getPathIterator(null);

        double x = 0, y = 0;
        double coords[] = new double[6];
        while (!pathIterator.isDone()) {
            switch (pathIterator.currentSegment(coords)) {
            case PathIterator.SEG_MOVETO:
                System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
                x = coords[0];
                y = coords[1];
                break;
            case PathIterator.SEG_LINETO:
                double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
                System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
                x = coords[0];
                y = coords[1];
                break;
            case PathIterator.SEG_QUADTO:
                System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
                x = coords[2];
                y = coords[3];
                break;
            case PathIterator.SEG_CUBICTO:
                System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
                x = coords[4];
                y = coords[5];
                break;
            case PathIterator.SEG_CLOSE:
                System.out.println("Close path");
            }
            pathIterator.next();
        }
    }

    double getEffectiveWidth(double dirX, double dirY)
    {
        if (dirX == 0 && dirY == 0)
            return 0;
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        double widthX = dirY;
        double widthY = -dirX;
        double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
        double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
        double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
        return getGraphicsState().getLineWidth() * factor;
    }
}

( PrintPaths.java )

由于我们不想真正绘制页面,而只是提取将要绘制的路径,因此我们必须像这样剥离PageDrawer. /p>

此示例解析器输出路径绘制操作以显示如何执行此操作.显然,您可以改为收集它们进行自动化处理...

您可以像这样使用解析器:

PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8

System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();

PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
    printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}

( ExtractPaths.java )

输出为:

Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

很奇特:垂直线实际上是绘制为非常短(约0.6单位),非常粗(约513单位)的水平线...

I am trying to find table border lines in pdf. I used PrintTextLocations class of pdfBox to make words. Now I am looking to find the coordinates of different lines that form the table. I tried using org.apache.pdfbox.pdfviewer.PageDrawer, but I am unable to find any character/graphic containing those lines. I tried two ways:

First:

Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);

It gave me null pointer exception. So secondly, I tried to override processStream function, but I am unable to get any stroke. Kindly help me out. I am open in using any other library which gives me coordinates of the lines in the table. And another quick question, what kind of objects are those table border lines in pdfbox? Are these graphics or are these characters?

Here is the link to the sample pdf I am trying to parse: http://stats.bls.gov/news.release/pdf/empsit.pdf and trying to get the table lines on page number 8.

Edit : I faced another problem, while parsing this pdf's page number 1, I am unable to get any lines as the pathIterator in printPath() function is empty, although strokePath() function is called for each line. How to work with this pdf?

解决方案

In the 1.8.* versions PDFBox parsing capabilities had been implemented in a not very generic way, in particular the OperatorProcessor implementations were tightly associated with specific parser classes, e.g. the implementations dealing with path drawing operations assumed to interact with a PageDrawer instance.

Thus, unless one wanted to copy & paste all those OperatorProcessor classes with minute changes, one had to derive from such a specific parser class.

In your case, therefore, we also will derive our parser from PageDrawer, after all we are interested in path drawing operations:

public class PrintPaths extends PageDrawer
{
    //
    // constructor
    //
    public PrintPaths() throws IOException
    {
        super();
    }

    //
    // method overrides for mere path observation
    //
    // ignore text
    @Override
    protected void processTextPosition(TextPosition text) { }

    // ignore bitmaps
    @Override
    public void drawImage(Image awtImage, AffineTransform at) { }

    // ignore shadings
    @Override
    public void shFill(COSName shadingName) throws IOException { }

    @Override
    public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
    {
        PDRectangle cropBox = aPage.findCropBox();
        this.pageSize = cropBox.createDimension();
        super.processStream(aPage, resources, cosStream);
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        printPath();
        System.out.printf("Fill; windingrule: %s\n\n", windingRule);
        getLinePath().reset();
    }

    @Override
    public void strokePath() throws IOException
    {
        printPath();
        System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
        getLinePath().reset();
    }

    void printPath()
    {
        GeneralPath path = getLinePath();
        PathIterator pathIterator = path.getPathIterator(null);

        double x = 0, y = 0;
        double coords[] = new double[6];
        while (!pathIterator.isDone()) {
            switch (pathIterator.currentSegment(coords)) {
            case PathIterator.SEG_MOVETO:
                System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
                x = coords[0];
                y = coords[1];
                break;
            case PathIterator.SEG_LINETO:
                double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
                System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
                x = coords[0];
                y = coords[1];
                break;
            case PathIterator.SEG_QUADTO:
                System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
                x = coords[2];
                y = coords[3];
                break;
            case PathIterator.SEG_CUBICTO:
                System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
                x = coords[4];
                y = coords[5];
                break;
            case PathIterator.SEG_CLOSE:
                System.out.println("Close path");
            }
            pathIterator.next();
        }
    }

    double getEffectiveWidth(double dirX, double dirY)
    {
        if (dirX == 0 && dirY == 0)
            return 0;
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        double widthX = dirY;
        double widthY = -dirX;
        double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
        double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
        double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
        return getGraphicsState().getLineWidth() * factor;
    }
}

(PrintPaths.java)

As we do not want to actually draw the page but merely extract the paths which would be drawn, we have to strip down the PageDrawer like this.

This sample parser outputs path drawing operations to show how to do it. Obviously you can instead collect them for automatized processing...

You can use the parser like this:

PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8

System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();

PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
    printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}

(ExtractPaths.java)

The output is:

Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43

Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981

Quite peculiar: The vertical lines actually are drawn as very short (ca 0.6 units) very thick (ca 513 units) horizontal lines...

这篇关于如何使用PDFBox在pdf中查找表格边框线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆