如何使用PDFBox确定实际PDF内容的位置? [英] How do determine location of actual PDF content with PDFBox?

查看:200
本文介绍了如何使用PDFBox确定实际PDF内容的位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用PDFBox从Java桌面应用程序中打印一些PDF,并且这些PDF包含太多空格(不幸的是,不能选择修复PDF生成器).

我遇到的问题是确定页面上实际内容的位置,因为裁剪/媒体/修剪/艺术/出血框没有用.有没有一种简单有效的方法,比将页面呈现为图像并检查哪些像素保持白色更好/更快?

解决方案

您在评论中提到的

可以假定没有背景或其他需要特殊处理的元素,

我将展示基本解决方案,而无需进行任何特殊处理.

基本的边界框查找器

要找到边界框而不实际渲染到位图并检查位图像素,必须扫描页面内容流的所有指令以及从那里引用的所有XObject.可以确定每条指令绘制的内容的边界框,并最终将它们组合为一个框.

这里展示的简单盒子查找器通过简单地返回它们并集的边界框来将它们组合起来.

为了扫描内容流的指令,PDFBox提供了许多基于 PDFStreamEngine 的类.简单的框查找器是从 PDFGraphicsStreamEngine 派生而来的,该格式通过与矢量图形有关的某些方法扩展了 PDFStreamEngine .

 公共类BoundingBoxFinder扩展了PDFGraphicsStreamEngine {公共BoundingBoxFinder(PDPage页面){超级(页面);}公共Rectangle2D getBoundingBox(){返回矩形}////文本//@Override受保护的void showGlyph(矩阵textRenderingMatrix,PDFont字体,整数代码,字符串Unicode,向量位移)引发IOException {super.showGlyph(textRenderingMatrix,字体,代码,unicode,置换);形状shape = calculateGlyphBounds(textRenderingMatrix,字体,代码);if(shape!= null){Rectangle2D rect = shape.getBounds2D();加(rect);}}/***< code> org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix,PDFont,int)</code>的副本.*/私有Shape CalculationGlyphBounds(Matrix textRenderingMatrix,PDFont字体,int代码)抛出IOException{GeneralPath path = null;AffineTransform at = textRenderingMatrix.createAffineTransform();at.concatenate(font.getFontMatrix().createAffineTransform());如果(PDType3Font的字体实例){//难以计算类型3字体的实际单个字形范围//因为这些不是矢量字体,所以内容流可能包含几乎所有内容//在页面内容流中找到.PDType3Font t3Font =(PDType3Font)字体;PDType3CharProc charProc = t3Font.getCharProc(code);如果(charProc!= null){BoundingBox fontBBox = t3Font.getBoundingBox();PDRectangle glyphBBox = charProc.getGlyphBBox();如果(glyphBBox!= null){//PDFBOX-3850:字形bbox可能大于字体bboxglyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(),glyphBBox.getLowerLeftX()));glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(),glyphBBox.getLowerLeftY()));glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(),glyphBBox.getUpperRightX()));glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(),glyphBBox.getUpperRightY()));路径= glyphBBox.toGeneralPath();}}}否则if(PDVectorFont的font instance){PDVectorFont vectorFont =(PDVectorFont)字体;路径= vectorFont.getPath(代码);如果(PDTrueTypeFont的字体实例){PDTrueTypeFont ttFont =(PDTrueTypeFont)字体;int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();at.scale(1000d/unitsPerEm,1000d/unitsPerEm);}如果(PDType0Font的字体实例){PDType0Font t0font =(PDType0Font)字体;如果(PDCIDFontType2的t0font.getDescendantFont()实例){int unitsPerEm =(((PDCIDFontType2)t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();at.scale(1000d/unitsPerEm,1000d/unitsPerEm);}}}否则if(PDSimpleFont的字体实例){PDSimpleFont simpleFont =(PDSimpleFont)字体;//这两行并不总是有效,例如文件032431.pdf中的TT字体//这就是为什么首先尝试PDVectorFont的原因.字符串名称= simpleFont.getEncoding().getName(code);路径= simpleFont.getPath(name);}别的{//不应该发生,请在JIRA中打开问题System.out.println(未知字体类:" + font.getClass());}如果(path == null){返回null;}返回at.createTransformedShape(path.getBounds2D());}////位图//@Override公共无效drawImage(PDImage pdImage)引发IOException {矩阵ctm = getGraphicsState().getCurrentTransformationMatrix();for(int x = 0; x< 2; x ++){for(int y = 0; y< 2; y ++){添加(ctm.transformPoint(x,y));}}}////路径//@Override公共无效appendRectangle(Point2D p0,Point2D p1,Point2D p2,Point2D p3)引发IOException {addToPath(p0,p1,p2,p3);}@Override公共无效剪辑(intwindingRule)引发IOException {}@Override公共无效moveTo(float x,float y)引发IOException {addToPath(x,y);}@Override公共无效lineTo(float x,float y)引发IOException {addToPath(x,y);}@Overridepublic void curveTo(float x1,float y1,float x2,float y2,float x3,float y3)引发IOException {addToPath(x1,y1);addToPath(x2,y2);addToPath(x3,y3);}@Override公共Point2D getCurrentPoint()引发IOException {返回null;}@Overridepublic void closePath()引发IOException {}@Override公共无效endPath()引发IOException {rectangularPath = null;}@Override公共无效stroke.Path()引发IOException {addPath();}@Overridepublic void fillPath(intwindingRule)引发IOException {addPath();}@Override公共无效fillAndStrokePath(intwindingRule)引发IOException {addPath();}@Overridepublic void shadingFill(COSName shadingName)引发IOException {}void addToPath(Point2D ... points){Arrays.asList(points).forEach(p-> addToPath(p.getX(),p.getY()));}void addToPath(double newx,double newy){如果(rectanglePath == null){rectangularPath = new Rectangle2D.Double(newx,newy,0,0);} 别的 {rectangularPath.add(newx,newy);}}void addPath(){如果(rectanglePath!= null){添加(rectanglePath);rectangularPath = null;}}无效add(Rectangle2D rect){如果(矩形==空){矩形=新的Rectangle2D.Double();rectangular.setRect(rect);} 别的 {rectangular.add(rect);}}无效add(Point2D ... points){对于(Point2D point:points){add(point.getX(),point.getY());}}无效add(double newx,double newy){如果(矩形==空){矩形=新的Rectangle2D.Double(newx,newy,0,0);} 别的 {rectangular.add(newx,newy);}}Rectangle2D矩形路径= null;Rectangle2D矩形= null;} 

((

仅概念证明

当心, BoundingBoxFinder 确实不是很复杂.特别是,它不会忽略不可见的内容,例如白色背景矩形,以渲染模式不可见"绘制的文本,由白色填充路径覆盖的任意内容,位图图像的白色部分,等等.此外,它也忽略剪切路径,奇怪的混合模式,注释,...

扩展类以正确处理这些情况很简单,但是要添加的代码总和将超出堆栈溢出答案的范围.


对于此答案中的代码,我使用了当前的PDFBox 3.0.0-SNAPSHOT开发分支,但对于当前的2.x版本,它也应该开箱即用.

We're printing some PDFs from a Java desktop app, using PDFBox, and the PDFs contain too much whitespace (fixing the PDF generator is unfortunately not an option).

The problem I have is determining where the actual content on the page is, because the crop/media/trim/art/bleed boxes are useless. Is there some easy and efficient way to do so, better/faster than rendering the page to an image and examining which pixels stayed white?

解决方案

As you have mentioned in a comment that

it can be assumed that there is no background or other elements that would need special handling,

I'll show a basic solution without any such special handling.

A basic bounding box finder

To find the bounding box without actually rendering to a bitmap and inspecting the bitmap pixels, one has to scan all the instructions of the content streams of the page and any XObjects referenced from there. One determines the bounding boxes of the stuff drawn by each instruction and eventually combines them to a single box.

The simple box finder presented here combines them by simply returning the bounding box of their union.

For scanning the instructions of content streams PDFBox offers a number of classes based on the PDFStreamEngine. The simple box finder is derived from the PDFGraphicsStreamEngine which extends the PDFStreamEngine by some method related to vector graphics.

public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
    public BoundingBoxFinder(PDPage page) {
        super(page);
    }

    public Rectangle2D getBoundingBox() {
        return rectangle;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            add(rect);
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                add(ctm.transformPoint(x, y));
            }
        }
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        addToPath(p0, p1, p2, p3);
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        addToPath(x1, y1);
        addToPath(x2, y2);
        addToPath(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return null;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        rectanglePath = null;
    }

    @Override
    public void strokePath() throws IOException {
        addPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
    }

    void addToPath(Point2D... points) {
        Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
    }

    void addToPath(double newx, double newy) {
        if (rectanglePath == null) {
            rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectanglePath.add(newx, newy);
        }
    }

    void addPath() {
        if (rectanglePath != null) {
            add(rectanglePath);
            rectanglePath = null;
        }
    }

    void add(Rectangle2D rect) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double();
            rectangle.setRect(rect);
        } else {
            rectangle.add(rect);
        }
    }

    void add(Point2D... points) {
        for (Point2D point : points) {
            add(point.getX(), point.getY());
        }
    }

    void add(double newx, double newy) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectangle.add(newx, newy);
        }
    }

    Rectangle2D rectanglePath = null;
    Rectangle2D rectangle = null;
}

(BoundingBoxFinder on github)

As you can see I borrowed the calculateGlyphBounds helper method from a PDFBox example class.

An usage example

You can use the BoundingBoxFinder like this to draw a border line along the bounding box rim for a given PDPage pdPage of a PDDocument pdDocument:

void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
    BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
    boxFinder.processPage(pdPage);
    Rectangle2D box = boxFinder.getBoundingBox();
    if (box != null) {
        try (   PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
            canvas.setStrokingColor(Color.magenta);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
        }
    }
}

(DetermineBoundingBox helper method)

The result looks like this:

Only a proof-of-concept

Beware, the BoundingBoxFinder really is not very sophisticated; in particular it does not ignore invisible content like a white background rectangle, text drawn in rendering mode "invisible", arbitrary content covered by a white filled path, white parts of bitmap images, ... Furthermore, it does ignore clip paths, weird blend modes, annotations, ...

Extending the class to properly handle those cases is pretty straight-forward but the sum of the code to add would exceed the scope of a stack overflow answer.


For the code in this answer I used the current PDFBox 3.0.0-SNAPSHOT development branch but it should also work out of the box for current 2.x versions.

这篇关于如何使用PDFBox确定实际PDF内容的位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆