如何检测PDF页面中是否包含图像 [英] How to detect if a PDF page has an image in it

查看:175
本文介绍了如何检测PDF页面中是否包含图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDF/A样式的PDF文档,该文档包含扫描的全页大小图像,然后是在ColumnText对象中具有文本的图像页面之后的一两页.

I am working with PDF/A style PDF documents that have a mixture of scanned in, full page size images and then a page or two after the image pages that have text in a ColumnText object.

使用Java,我如何检测哪些页面上有图像?

Using Java, how do i detect which pages have an image?

检测哪些页面具有图像或文本的目的是确定第一页带有文本的位置.我需要编辑文本或用更新后的文本替换页面.带有图像的页面将保持不变.

The intent to detect which pages have either images or text is to determine where the first page with text appears. I need to either edit the text or replace the page(s) with text with updated text. The pages with images would remain untouched.

我正在使用iText5,目前没有升级到iText7的选择.

I'm using iText5 and don't currently have the option of upgrading to iText7.

这是我使用@mkl提供的解决方案实现的解决方案:

Here's the solution I implemented with the solution provided by @mkl:

ImageDetector.java

ImageDetector.java

package org.test.pdf;

import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.RenderListener;
import com.itextpdf.text.pdf.parser.TextRenderInfo;

public class ImageDetector implements RenderListener {
    public void beginTextBlock() { }
    public void endTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) {
        textFound = true;
    }

    public void renderImage(ImageRenderInfo renderInfo) {
        imageFound = true;
    }

    boolean textFound = false;
    boolean imageFound = false;
}

PdfDocumentServiceTest.java

PdfDocumentServiceTest.java

package org.test.pdf;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.test.PdfService;
import org.junit.Assert;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.junit4.SpringRunner;
import org.springframework.transaction.annotation.Transactional;

@ActiveProfiles({"local", "testing"})
@DirtiesContext
@Transactional
@RunWith(SpringRunner.class)
@SpringBootTest
public class PdfDocumentServiceTest {

    @Autowired
    private PdfService pdfService;

    @Test
    public void testFindImagesInPdf(Long pdfId)) {
        final byte[] resource = PdfService.getPdf(pdfId);
        int imagePageCount = 0;
        int textPageCount = 0;
        if (resource != null && resource.length > 0) {
            PdfReader reader = new PdfReader(resource);
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);

            for (int pageNumber = 1; pageNumber <= reader.getNumberOfPages(); pageNumber++) {

                ImageDetector imageDetector = new ImageDetector();
                parser.processContent(pageNumber, imageDetector);

                if (imageDetector.imageFound) {
                    imagePageCount++;
                }
                if (imageDetector.textFound) {
                    textPageCount++;
                }
            }
            Assert.assertTrue(imagePageCount > 0);
            Assert.assertTrue(textPageCount > 0);
        }
    }
}

推荐答案

使用iText 5,您可以通过将页面内容解析到自定义的RenderListener实现中,找出图像是否实际显示在页面上.例如

Using iText 5 you can find out whether images actually are shown on a page by parsing the page content into a custom RenderListener implementation. E.g.

class ImageDetector implements RenderListener {
    public void beginTextBlock() { }
    public void endTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) { }

    public void renderImage(ImageRenderInfo renderInfo) {
        imageFound = true;
    }

    boolean imageFound = false;
}

像这样使用:

PdfReader reader = new PdfReader(resource);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int pageNumber = 1; pageNumber <= reader.getNumberOfPages(); pageNumber++)
{
    ImageDetector imageDetector = new ImageDetector();
    parser.processContent(pageNumber, imageDetector);
    if (imageDetector.imageFound) {
        // There is at least one image rendered on page i
        // Thus, handle it as an image page
    } else {
        // There is no image rendered on page i
        // Thus, handle it as a no-image page
    }
}

可能的改进:在注释中提及整页大小的图像.因此,在ImageDetector方法renderImage中,您可能需要在将imageFound设置为true之前检查图像大小.通过ImageRenderInfo参数,您既可以获取有关页面上显示的图像大小,又可以实际显示图像的大小的信息.

As a possible improvement: In a comment you mention full-page-size images. Thus, in the ImageDetector method renderImage you might want to check the image size before setting imageFound to true. Via the ImageRenderInfo parameter you can retrieve both information on how large the image is displayed on the page and how large it actually is.

这篇关于如何检测PDF页面中是否包含图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆