PDFBox - 线/矩形提取 [英] PDFBox - Line / Rectangle extraction

查看:57
本文介绍了PDFBox - 线/矩形提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 PDF 中提取文本坐标和线(或矩形)坐标.

I am trying to extract text coordinates and line (or rectangle) coordinates from a PDF.

TextPosition 类具有 getXDirAdj()getYDirAdj() 方法,它们根据文本片段的方向将坐标转换为各自的 TextPosition对象代表(根据@mkl 的评论更正)无论页面旋转如何,最终输出都是一致的.

The TextPosition class has getXDirAdj() and getYDirAdj() methods which transform coordinates according to the direction of the text piece the respective TextPosition object represents (Corrected based on comment from @mkl) The final output is consistent, irrespective of the page rotation.

输出所需的坐标为 X0,Y0(页面左上角)

The coordinates needed on the output are X0,Y0 (TOP LEFT CORNER OF THE PAGE)

这是对@Tilman Hausherr 的解决方案的轻微修改.将 y 坐标反转(高度 - y)以使其与文本提取过程中的坐标保持一致,并将输出写入 csv.

This is a slight modification from the solution by @Tilman Hausherr. The y coordinates are inverted (height - y) to keep it consistent with the coordinates from the text extraction process, also the output is written to a csv.

    public class LineCatcher extends PDFGraphicsStreamEngine
{
    private static final GeneralPath linePath = new GeneralPath();
    private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
    private int clipWindingRule = -1;
    private static String headerRecord = "Text|Page|x|y|width|height|space|font";

    public LineCatcher(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        if( args.length != 4 )
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            FileOutputStream fop = null;
            File file;
            Writer osw = null;
            int numPages;
            double page_height;
            try
            {
                document = PDDocument.load( new File(args[0], args[1]) );
                numPages = document.getNumberOfPages();
                file = new File(args[2], args[3]);
                fop = new FileOutputStream(file);

                // if file doesnt exists, then create it
                if (!file.exists()) {
                    file.createNewFile();
                }

                osw = new OutputStreamWriter(fop, "UTF8");
                osw.write(headerRecord + System.lineSeparator());
                System.out.println("Line Processing numPages:" + numPages);
                for (int n = 0; n < numPages; n++) {
                    System.out.println("Line Processing page:" + n);
                    rectList = new ArrayList<Rectangle2D>();
                    PDPage page = document.getPage(n);
                    page_height = page.getCropBox().getUpperRightY();
                    LineCatcher lineCatcher = new LineCatcher(page);
                    lineCatcher.processPage(page);

                    try{
                        for(Rectangle2D rect:rectList) {

                            String pageNum = Integer.toString(n + 1);
                            String x = Double.toString(rect.getX());
                            String y = Double.toString(page_height - rect.getY()) ;
                            String w = Double.toString(rect.getWidth());
                            String h = Double.toString(rect.getHeight());
                            writeToFile(pageNum, x, y, w, h, osw);

                        }
                        rectList = null;
                        page = null;
                        lineCatcher = null;
                    }
                    catch(IOException io){
                        throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);
                    }
                };

            }
            catch(IOException io){
                throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
            }
            finally
            {
                if ( osw != null ){
                    osw.close();
                }
                if( document != null )
                {
                    document.close();
                }
            }
        }
    }

    private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
        String c = "^" + "|" +
                pageNum + "|" +
                x + "|" +
                y + "|" +
                w + "|" +
                h + "|" +
                "999" + "|" +
                "marker-only";
        osw.write(c + System.lineSeparator());
    }

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        // to ensure that the path is created in the right direction, we have to create
        // it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

        // close the subpath instead of adding the last line so that a possible set line
        // cap style isn't taken into account at the "beginning" of the rectangle
        linePath.closePath();
    }

    @Override
    public void drawImage(PDImage pdi) throws IOException
    {
    }

    @Override
    public void clip(int windingRule) throws IOException
    {
        // the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;

    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        linePath.moveTo(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        linePath.lineTo(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        linePath.curveTo(x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        return linePath.getCurrentPoint();
    }

    @Override
    public void closePath() throws IOException
    {
        linePath.closePath();
    }

    @Override
    public void endPath() throws IOException
    {
        if (clipWindingRule != -1)
        {
            linePath.setWindingRule(clipWindingRule);
            getGraphicsState().intersectClippingPath(linePath);
            clipWindingRule = -1;
        }
        linePath.reset();

    }

    @Override
    public void strokePath() throws IOException
    {
        rectList.add(linePath.getBounds2D());
        linePath.reset();
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void shadingFill(COSName cosn) throws IOException
    {
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>"  + " <output-file>");
    }
}

使用 PDFGraphicsStreamEngine 类来提取线和矩形坐标.线条和矩形的坐标与文本的坐标不对齐

Was using the PDFGraphicsStreamEngine class to extract Line and Rectangle coordinates. The coordinates of lines and rectangles do not align with the coordinates of the text

绿色:文本红色:按原样获得的线坐标黑色:预期坐标(对输出应用变换后获得)

Green: Text Red: Line coordinates obtained as is Black: Expected coordinates (Obtained after applying transformation on the output)

在运行线提取之前尝试了 setRotation() 方法来纠正旋转.然而结果并不一致.

Tried the setRotation() method to correct for the rotation before running the line extract. However the results are not consistent.

使用 PDFBox 获得旋转并获得一致的线/矩形坐标输出的可能选项是什么?

What are the possible options to get the rotation and get a consistent output of the Line / Rectangle coordinates using PDFBox?

推荐答案

据我了解这里的需求,OP工作在一个坐标系中,原点在可见页面的左上角(取页面旋转考虑),x 坐标向右增加,y 坐标向下增加,单位为 PDF 默认用户空间单位(通常为 1/72 英寸).

As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually 1/72 inch).

在这个坐标系中,他需要以

In this coordinate system he needs to extract (horizontal or vertical) lines in the form of

  • 左/上端点的坐标和
  • 宽度/高度.

另一方面,他从 Tilman 获得的辅助类 LineCatcher 没有考虑页面旋转.此外,它返回垂直线的底部端点,而不是顶部端点.因此,必须对 LineCatcher 结果应用坐标变换.

The helper class LineCatcher he got from Tilman, on the other hand, does not take page rotation into account. Furthermore, it returns the bottom end point for vertical lines, not the top end point. Thus, a coordinate transformation has to be applied to of the LineCatcher results.

为此只需替换

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);
}

int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    default:
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    }
    writeToFile(pageNum, x, y, w, h, osw);
}

(ExtractLinesWithDir 测试 testExtractLineRotationTestWithDir)

OP 通过引用TextPosition 类方法getXDirAdj()getYDirAdj() 来描述坐标.实际上,这些方法返回坐标系中的坐标,原点位于页面左上角,y 坐标在旋转页面后向下增加,从而使文本垂直绘制.

The OP describes the coordinates by referring to the TextPosition class methods getXDirAdj() and getYDirAdj(). Indeed, these methods return coordinates in a coordinate system with the origin in the upper left page corner and y coordinates increasing downwards after rotating the page so that the text is drawn upright.

在示例文档的情况下,所有文本都在应用页面旋转后绘制为直立.由此得出我对上面写的要求的理解.

In case of the example document all the text is drawn so that it is upright after applying the page rotation. From this my understanding of the requirement written at the top has been derived.

使用 TextPosition.get?DirAdj() 值作为全局坐标的问题是,在具有不同方向绘制文本的页面的文档中,收集的文本坐标突然相对于不同的坐标系.因此,通用解决方案不应该像那样疯狂地收集坐标.相反,它应该首先确定页面方向(例如,页面旋转给出的方向或大多数文本共享的方向)并使用该方向给出的固定坐标系中的坐标加上文本书写方向的指示有问题的作品.

The problem with using the TextPosition.get?DirAdj() values as coordinates globally, though, is that in documents with pages with text drawn in different directions, the collected text coordinates suddenly are relative to different coordinate systems. Thus, a general solution should not collect coordinates wildly like that. Instead it should determine a page orientation at first (e.g. the orientation given by the page rotation or the orientation shared by most of the text) and use coordinates in the fixed coordinate system given by that orientation plus an indication of the writing direction of the text piece in question.

这篇关于PDFBox - 线/矩形提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆