如何识别PDF水印并使用PDFBox将其删除 [英] How to recognize PDF watermark and remove it using PDFBox

查看:1200
本文介绍了如何识别PDF水印并使用PDFBox将其删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Apache PDFBox库从PDF文件中提取除水印文本之外的文本,因此我想先删除水印,而剩下的就是我想要的.但是很遗憾,PDmetadata和PDXObject都无法识别水印,我们将在下面找到一些代码.

I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.

        // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();//How to specify watermark instead of images??
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }

推荐答案

与您的假设相反,PDF中没有像显式水印对象那样可以识别通用PDF中的水印.

In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.

水印可以通过多种方式应用于PDF页面.每个PDF创建库或应用程序都有其自己的添加水印的方式,有些甚至提供了多种方式.

Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.

水印可以是

  1. 在内容的早期绘制的任何内容(位图图形,矢量图形,文本等),因此形成了绘制其余内容的背景;
  2. 在内容后期以透明方式绘制的任何内容(位图图形,矢量图形,文本等),形成透明的叠加层;
  3. 在水印注释的内容流中绘制的任何内容(位图图形,矢量图形,文本等),应用于表示应以固定尺寸和固定位置打印在页面上的图形,不论打印页面的尺寸(参见PDF规范 ISO 32000-1 ).
  1. anything (Bitmap graphics, vector graphics, text, ...) drawn early in the content and, therefore, forming a background on which the rest of the content is drawn;
  2. anything (Bitmap graphics, vector graphics, text, ...) drawn late in the content with transparency, forming a transparent overlay;
  3. anything (Bitmap graphics, vector graphics, text, ...) drawn in the content stream of a watermark annotation which shall be used to represent graphics that shall be printed at a fixed size and position on a page, regardless of the dimensions of the printed page (cf. section 12.5.6.22 of the PDF specification ISO 32000-1).

有时甚至使用混合形式,请查看

Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).

后一种选择(水印注解)显然很容易,但是实际上它也是最不常用的选择,很可能是因为消除;应用水印的人们通常不希望其水印丢失.此外,PDF查看器有时会错误地处理批注,并且代码复制页面内容通常会忽略批注.

The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.

另一方面,如果您不处理通用文档,而是处理特定类型的文档(都生成相同的文档),则处理水印的方式在其中,可能可以识别并且提取例程可能是可行的.如果您有这样的用例,请共享一个样本PDF进行检查.

If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.

这篇关于如何识别PDF水印并使用PDFBox将其删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆