使用 PDFbox 从 PDF 文件中删除图像 [英] delete am image from a PDF file using PDFbox
问题描述
我正在尝试使用 java 和 PDFbox 从 PDF 中删除图像.图像不是内嵌的,PDF 没有图案或表格.pdf 文件包含 2 张图像.PDFdebugger 工具显示 Resources >>XObject >>IM3 和 IM5.问题是:我显示输出的pdf文件并且图像没有被删除.
I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.
public class DeleteImage {
public static void removeImages(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
for (PDPage page : document.getPages()) {
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -> {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
System.out.println("propertyName" + propertyName);
page.getCOSObject().removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
for (COSName name : page.getResources().getPatternNames()) {
PDAbstractPattern pattern = page.getResources().getPattern(name);
System.out.println("have pattern");
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
System.out.println("original tokens size" + tokens.size());
List<Object> newTokens = new ArrayList<Object>();
for(int j=0; j<tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof Operator ) {
Operator op = (Operator)token;
System.out.println("operation" + op.getName());
//find image - remove it
if( op.getName().equals("Do") ) {
System.out.println("op equals Do");
newTokens.remove(newTokens.size()-1);
continue;
} else if ("BI".equals(op.getName())) {
System.out.println("inline -- op equals BI");
} else {
System.out.println("op not quals Do");
}
}
newTokens.add(token);
}
PDDocument newDoc = new PDDocument();
PDPage newPage = newDoc.importPage(page);
newPage.setResources(page.getResources());
System.out.println("tokens size" + newTokens.size());
PDStream newContents = new PDStream(newDoc);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter( out );
writer.writeTokens( newTokens);
out.close();
newPage.setContents( newContents );
}
document.save("RemoveImage.pdf");
document.close();
}
public static void remove(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
PDResources resources = null;
for (PDPage page : document.getPages()) {
resources = page.getResources();
for (COSName name : resources.getXObjectNames()) {
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDImageXObject) {
System.out.println("have image");
removeImages(pdfFile);
}
}
}
document.save("RemoveImage.pdf");
document.close();
}
}
推荐答案
If You Call remove
...
在删除
你
- 将PDF加载到
document
中, - 遍历
document
的页面,并针对每个页面- 遍历 XObject 资源,并为每个 Xobject
- 检查是否是图像Xobject,如果是
- 调用
removeImages
加载相同的原始文件,对其进行处理,并将结果保存为RemoveImage.pdf".
- load the PDF into
document
, - iterate over the pages of
document
, and for each page- iterate over the XObject resources, and for each Xobject
- check whether it is an image Xobject, and if it is
- call
removeImages
which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
因此,在最后一步中,您覆盖您可能在
removeImages
中所做的任何更改,并最终将原始文件保存在RemoveImage.pdf"中!So in that last step you overwrite any changes you may have done in
removeImages
and end up with your original file in "RemoveImage.pdf"!在
removeImages
中,您做了一些更改,但存在某些问题:In
removeImages
you do some changes but there are certain issues:每当你找到一个图像 Xobject 资源时,你试图直接从页面中删除它
Whenever you find an image Xobject resource, you attempt to remove it from the page directly
page.getCOSObject().removeItem(propertyName);
但是图像 Xobject 资源不是
page
的直接子资源,它由pdResources
管理,因此您应该从那里删除它.but the image Xobject resource is not a direct child of the
page
, it is managed bypdResources
, so you should remove it from there.您从页面内容中删除所有执行说明,不仅是图像 Xobjects 的说明,因此您删除的内容可能超出您的需要.
You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.
这篇关于使用 PDFbox 从 PDF 文件中删除图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- call
- check whether it is an image Xobject, and if it is
- iterate over the XObject resources, and for each Xobject
- 调用
- 检查是否是图像Xobject,如果是
- 遍历 XObject 资源,并为每个 Xobject