如何检查PDF页面是否为PDFBOX,XPDF的图像(已扫描) [英] How can I check if PDF page is image(scanned) by PDFBOX, XPDF

查看:149
本文介绍了如何检查PDF页面是否为PDFBOX,XPDF的图像(已扫描)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

提取图像时出现PDFBox问题. 我如何检查pdf页面是否为图像并通过PDFBOX库提取该图像,有一种获取图像的方法,但是如果PDF页面为图像则无法获取.有人可以帮助我解决这个问题吗?

PDFBox problem on extract images. Hi, how I can check if pdf page is image and to extract that by PDFBOX library, there is a method to get images but if PDF Page is a Image it is not getting. could some one help me to solve this problem.

Xpdf问题. 我尝试通过另一个库xpdf提取图像,如果它是图像,则会在页面上进行奇怪的翻转.如果pdf包含一个小的图像作为目标图像,则可以,如果扫描页面,我们可以进行翻转.

Xpdf problem on extract images. I try to extract images by another library xpdf it do strange flip on the page if it is a image. If pdf contain an small image as object image it give me ok, if page is scanned he us doing flip.

如果要扫描PAGE以将其作为图像,如果Page包含纯文本,而Image也从该页面获取图像,则我想从PDF中提取所有图像.

I want to extract the all Images from PDF, if PAGE is scanned to get them as image, if Page contain plain text and Images also to get Images from this page.

我的观点是从PDF中提取所有图像.不仅形成页面,即使页面是将其提取为图像的图像,也不要跳过它们,我认为PDFbox怎么样.

My point is to extract all Images from PDF. not only form a page but even if page is a image to extract them as image but do not skip them how is doing I think PDFbox.

XPDF正在执行某些操作,但是当他导出扫描的页面时页面上的FLIP(顶部,右侧)有问题

XPDF is doing some thing but there is a problem FLIP(top,right) on page when he export scanned page

如何解决此问题,谢谢.

How can I solve this problem thanks.

下载要测试的文件示例

    `PDDocument document = PDDocument.load(new File("/home/dru/IdeaProjects2/PDFExtractor/test/t1.pdf"));
    PDPageTree list = document.getPages();

    for (PDPage page : list) {
        PDResources pdResources = page.getResources();
        System.out.println(pdResources.getResourceCache());

        for (COSName c : pdResources.getXObjectNames()) {
            PDXObject o = pdResources.getXObject(c);

            if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
                File file = new File("/home/dru/IdeaProjects2/PDFExtractor/test/out/" + System.nanoTime() + ".png");
                ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
            }
        }
    }`

推荐答案

正确提取图像

随着更新的PDF清楚地表明,问题在于它在页面上没有即时图像,但是在其上绘制了包含图像的表格xobject.因此,图像搜索必须递归为xobjects形式.

Extract images properly

As the updated PDF makes clear the problem is that it does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, the image search has to recurse into the form xobjects.

这还不是全部:更新的PDF中的所有页面共享相同的资源字典,它们只是选择了不同形式的xobjects来显示.因此,实际上必须解析相应的页面内容流,以确定给定页面上存在哪个xobject(带有哪些图像).

And that is not all: All pages in the updated PDF share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, one really has to parse the respective page content stream to determine which xobject (with which images) is present on a given page.

实际上,这是PDFBox工具ExtractImages的功能.遗憾的是,尽管如此,它没有显示发现有问题图像的页面,请参见.

Actually this is something the PDFBox tool ExtractImages does. Unfortunately, though, it does not show the page it found the image in question on, cf. the ExtractImages.java test method testExtractPageImagesTool10948New.

但是我们可以简单地借用该工具使用的技术:

But we can simply borrow from the technique used by that tool:

PDDocument document = PDDocument.load(resource);
int page = 1;
for (final PDPage pdPage : document.getPages())
{
    final int currentPage = page;
    PDFGraphicsStreamEngine pdfGraphicsStreamEngine = new PDFGraphicsStreamEngine(pdPage)
    {
        int index = 0;
        
        @Override
        public void drawImage(PDImage pdImage) throws IOException
        {
            if (pdImage instanceof PDImageXObject)
            {
                PDImageXObject image = (PDImageXObject)pdImage;
                File file = new File(RESULT_FOLDER, String.format("10948-new-engine-%s-%s.%s", currentPage, index, image.getSuffix()));
                ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
                index++;
            }
        }

        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

        @Override
        public void clip(int windingRule) throws IOException { }

        @Override
        public void moveTo(float x, float y) throws IOException {  }

        @Override
        public void lineTo(float x, float y) throws IOException { }

        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {  }

        @Override
        public Point2D getCurrentPoint() throws IOException { return null; }

        @Override
        public void closePath() throws IOException { }

        @Override
        public void endPath() throws IOException { }

        @Override
        public void strokePath() throws IOException { }

        @Override
        public void fillPath(int windingRule) throws IOException { }

        @Override
        public void fillAndStrokePath(int windingRule) throws IOException { }

        @Override
        public void shadingFill(COSName shadingName) throws IOException { }
    };
    pdfGraphicsStreamEngine.processPage(pdPage);
    page++;
}

(

(ExtractImages.java test method testExtractPageImages10948New)

此代码输出文件名为"10948-new-engine-1-0.tiff","10948-new-engine-2-0.tiff","10948-new-engine-3-"的图像0.tiff"和"10948-new-engine-4-0.tiff",即每页一个.

This code outputs images with file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", and "10948-new-engine-4-0.tiff", i.e. one per page.

PS::请记住在类路径中包含com.github.jai-imageio:jai-imageio-core,这对于TIFF输出是必需的.

PS: Please remember to include com.github.jai-imageio:jai-imageio-core in your classpath, it is required for TIFF output.

OP的另一个问题是图像有时会上下颠倒,例如如果是他现在最新的示例文件"t1_edited.pdf".原因是这些图像确实以PDF图像资源的形式上下颠倒存储.

Another issue of the OP was that the images sometimes appear flipped upside-down, e.g. in case of his now newest sample file "t1_edited.pdf". The reason is that those images indeed are stored upside-down as image resources in the PDF.

将这些图像绘制到页面上时,当时有效的当前转换矩阵会镜像垂直绘制的图像,从而产生预期的外观.

When those images are drawn onto a page, the current transformation matrix in effect at that time mirrors the image drawn vertically and so creates the expected appearance.

通过略微增强上面代码中的drawImage实现,可以在导出的图像名称中包括这种翻转的指示符:

By slightly enhancing the drawImage implementation in the code above, one can include indicators of such flips in the names of the exported images:

public void drawImage(PDImage pdImage) throws IOException
{
    if (pdImage instanceof PDImageXObject)
    {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        String flips = "";
        if (ctm.getScaleX() < 0)
            flips += "h";
        if (ctm.getScaleY() < 0)
            flips += "v";
        if (flips.length() > 0)
            flips = "-" + flips;
        PDImageXObject image = (PDImageXObject)pdImage;
        File file = new File(RESULT_FOLDER, String.format("t1_edited-engine-%s-%s%s.%s", currentPage, index, flips, image.getSuffix()));
        ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
        index++;
    }
}

现在已相应地标记了垂直或水平翻转的图像.

Now vertically or horizontally flipped images are marked accordingly.

这篇关于如何检查PDF页面是否为PDFBOX,XPDF的图像(已扫描)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆