如何检测PDF页面中是否包含图像 [英] How to detect if a PDF page has an image in it
问题描述
我正在使用PDF/A样式的PDF文档,该文档包含扫描的全页大小图像,然后是在ColumnText对象中具有文本的图像页面之后的一两页.
I am working with PDF/A style PDF documents that have a mixture of scanned in, full page size images and then a page or two after the image pages that have text in a ColumnText object.
使用Java,我如何检测哪些页面上有图像?
Using Java, how do i detect which pages have an image?
检测哪些页面具有图像或文本的目的是确定第一页带有文本的位置.我需要编辑文本或用更新后的文本替换页面.带有图像的页面将保持不变.
The intent to detect which pages have either images or text is to determine where the first page with text appears. I need to either edit the text or replace the page(s) with text with updated text. The pages with images would remain untouched.
我正在使用iText5,目前没有升级到iText7的选择.
I'm using iText5 and don't currently have the option of upgrading to iText7.
这是我使用@mkl提供的解决方案实现的解决方案:
Here's the solution I implemented with the solution provided by @mkl:
ImageDetector.java
ImageDetector.java
package org.test.pdf;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.RenderListener;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class ImageDetector implements RenderListener {
public void beginTextBlock() { }
public void endTextBlock() { }
public void renderText(TextRenderInfo renderInfo) {
textFound = true;
}
public void renderImage(ImageRenderInfo renderInfo) {
imageFound = true;
}
boolean textFound = false;
boolean imageFound = false;
}
PdfDocumentServiceTest.java
PdfDocumentServiceTest.java
package org.test.pdf;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.test.PdfService;
import org.junit.Assert;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.annotation.DirtiesContext;
import org.springframework.test.context.ActiveProfiles;
import org.springframework.test.context.junit4.SpringRunner;
import org.springframework.transaction.annotation.Transactional;
@ActiveProfiles({"local", "testing"})
@DirtiesContext
@Transactional
@RunWith(SpringRunner.class)
@SpringBootTest
public class PdfDocumentServiceTest {
@Autowired
private PdfService pdfService;
@Test
public void testFindImagesInPdf(Long pdfId)) {
final byte[] resource = PdfService.getPdf(pdfId);
int imagePageCount = 0;
int textPageCount = 0;
if (resource != null && resource.length > 0) {
PdfReader reader = new PdfReader(resource);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int pageNumber = 1; pageNumber <= reader.getNumberOfPages(); pageNumber++) {
ImageDetector imageDetector = new ImageDetector();
parser.processContent(pageNumber, imageDetector);
if (imageDetector.imageFound) {
imagePageCount++;
}
if (imageDetector.textFound) {
textPageCount++;
}
}
Assert.assertTrue(imagePageCount > 0);
Assert.assertTrue(textPageCount > 0);
}
}
}
推荐答案
使用iText 5,您可以通过将页面内容解析到自定义的RenderListener
实现中,找出图像是否实际显示在页面上.例如
Using iText 5 you can find out whether images actually are shown on a page by parsing the page content into a custom RenderListener
implementation. E.g.
class ImageDetector implements RenderListener {
public void beginTextBlock() { }
public void endTextBlock() { }
public void renderText(TextRenderInfo renderInfo) { }
public void renderImage(ImageRenderInfo renderInfo) {
imageFound = true;
}
boolean imageFound = false;
}
像这样使用:
PdfReader reader = new PdfReader(resource);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int pageNumber = 1; pageNumber <= reader.getNumberOfPages(); pageNumber++)
{
ImageDetector imageDetector = new ImageDetector();
parser.processContent(pageNumber, imageDetector);
if (imageDetector.imageFound) {
// There is at least one image rendered on page i
// Thus, handle it as an image page
} else {
// There is no image rendered on page i
// Thus, handle it as a no-image page
}
}
可能的改进:在注释中提及整页大小的图像.因此,在ImageDetector
方法renderImage
中,您可能需要在将imageFound
设置为true
之前检查图像大小.通过ImageRenderInfo
参数,您既可以获取有关页面上显示的图像大小,又可以实际显示图像的大小的信息.
As a possible improvement: In a comment you mention full-page-size images. Thus, in the ImageDetector
method renderImage
you might want to check the image size before setting imageFound
to true
. Via the ImageRenderInfo
parameter you can retrieve both information on how large the image is displayed on the page and how large it actually is.
这篇关于如何检测PDF页面中是否包含图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!