使用 PDFbox 从 PDF 文件中删除图像 [英] delete am image from a PDF file using PDFbox

查看:116
本文介绍了使用 PDFbox 从 PDF 文件中删除图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 java 和 PDFbox 从 PDF 中删除图像.图像不是内嵌的,PDF 没有图案或表格.pdf 文件包含 2 张图像.PDFdebugger 工具显示 Resources >>XObject >>IM3 和 IM5.问题是:我显示输出的pdf文件并且图像没有被删除.

I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.

public class DeleteImage {
    public static void removeImages(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));

        for (PDPage page : document.getPages()) {
            PDResources pdResources = page.getResources();
            pdResources.getXObjectNames().forEach(propertyName -> {
                if(!pdResources.isImageXObject(propertyName)) {
                    return;
                }
                PDXObject o;
                try {
                    o = pdResources.getXObject(propertyName);
                    if (o instanceof PDImageXObject) {
                        System.out.println("propertyName" + propertyName);
                        page.getCOSObject().removeItem(propertyName);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });

            for (COSName name :  page.getResources().getPatternNames()) {
                PDAbstractPattern pattern = page.getResources().getPattern(name);
                System.out.println("have pattern");
            }
              
            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List<Object> tokens = parser.getTokens();
            System.out.println("original tokens size" + tokens.size());
            List<Object> newTokens = new ArrayList<Object>();

            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );
                if( token instanceof Operator ) {
                    Operator op = (Operator)token;

                    System.out.println("operation" + op.getName());
                    //find image - remove it
                    if( op.getName().equals("Do") ) {
                        System.out.println("op equals Do");
                        newTokens.remove(newTokens.size()-1);
                        continue;
                    } else if ("BI".equals(op.getName())) {
                        System.out.println("inline -- op equals BI");
                    } else {
                        System.out.println("op not quals Do");
                    }
                }
                newTokens.add(token);
            }

            PDDocument newDoc = new PDDocument();
            PDPage newPage = newDoc.importPage(page);
            newPage.setResources(page.getResources());

            System.out.println("tokens size" + newTokens.size());
            PDStream newContents = new PDStream(newDoc);
            OutputStream out = newContents.createOutputStream();
            ContentStreamWriter writer = new ContentStreamWriter( out );
            writer.writeTokens( newTokens);
            out.close();
            newPage.setContents( newContents );
        }

        document.save("RemoveImage.pdf");
        document.close();
    }

    public static void remove(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));
        PDResources resources = null;
        
        for (PDPage page : document.getPages()) {
            resources = page.getResources();

            for (COSName name : resources.getXObjectNames()) {
                PDXObject xobject = resources.getXObject(name);
                
                if (xobject instanceof PDImageXObject) {
                    System.out.println("have image");
                    removeImages(pdfFile);
                }
            }
        }
        document.save("RemoveImage.pdf");
        document.close();
    }
}

推荐答案

If You Call remove...

删除

  • 将PDF加载到document中,
  • 遍历document 的页面,并针对每个页面
    • 遍历 XObject 资源,并为每个 Xobject
      • 检查是否是图像Xobject,如果是
        • 调用 removeImages 加载相同的原始文件,对其进行处理,并将结果保存为RemoveImage.pdf".
        • load the PDF into document,
        • iterate over the pages of document, and for each page
          • iterate over the XObject resources, and for each Xobject
            • check whether it is an image Xobject, and if it is
              • call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".

              因此,在最后一步中,您覆盖您可能在 removeImages 中所做的任何更改,并最终将原始文件保存在RemoveImage.pdf"中!

              So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!

              removeImages 中,您做了一些更改,但存在某些问题:

              In removeImages you do some changes but there are certain issues:

              • 每当你找到一个图像 Xobject 资源时,你试图直接从页面中删除它

              • Whenever you find an image Xobject resource, you attempt to remove it from the page directly

              page.getCOSObject().removeItem(propertyName);
              

              但是图像 Xobject 资源不是 page 的直接子资源,它由 pdResources 管理,因此您应该从那里删除它.

              but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.

              您从页面内容中删除所有执行说明,不仅是图像 Xobjects 的说明,因此您删除的内容可能超出您的需要.

              You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.

              这篇关于使用 PDFbox 从 PDF 文件中删除图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆