确定PDF页面是包含文本还是纯图片 [英] Determine whether a PDF page contains text or is purely picture
问题描述
如何使用Java确定PDF页面是否包含文本或纯图片?
How to determine whether a PDF page contains text or is purely picture, using Java?
我搜索了很多论坛和网站,但我找不到答案然而。
I searched through many forums and websites, but I can not find an answer yet .
是否可以从PDF中提取文本,以了解页面是否采用格式图片或文本?
Is it possible to extract text from PDF, to know if the page is in the format picture or text?
PdfReader reader = new PdfReader(INPUTFILE);
PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// here I want to test the structure of the page !!!! if it's possible
out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
推荐答案
没有防水做你想做的事。
There is no water-proof way to do what you want.
文本可以在PDF文件中以不同的方式出现。例如:可以使用图形状态运算符而不是使用文本状态来绘制所有字形。 (对不起,如果这对您来说听起来像中文,但我可以向您保证这是正确的PDF语言。)
Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)
如果是一个涵盖最常见情况的临时解决方案并偶尔错过一个异国情调的PDF对你来说没关系,那么你已经有了一个很好的第一个解决方法。
If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.
在你的代码中,你遍历所有页面,并且你问iText页面上是否有任何文字。这已经很好了。
In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.
在内部,您的代码使用的是 RenderListener
界面。 iText解析页面内容并触发特定 RenderListener
实现中的方法。这是自定义实现的实现: MyTextRenderListener 。此自定义实现用于 ParsingHelloWorld 示例。
Internally, your code is using the RenderListener
interface. iText parses the content of a page and triggers methods in a specific RenderListener
implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.
还有一个 renderImage()
方法(参见例如 MyImageListener )。如果触发此方法,则您100%确定页面中还有一个图像,并且您可以使用 ImageRenderInfo
对象来获取位置,宽度和高度图像(即:如果您知道如何解释 getImageCTM()
方法返回的 Matrix
)。
There's also a renderImage()
method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo
object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix
returned by the getImageCTM()
method).
使用所有这些元素,您已经可以很长时间地实现所需,但请注意,总会有异国情况的PDF会逃脱您的所有检查。
Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.
这篇关于确定PDF页面是包含文本还是纯图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!