如何使用java从pdf文档中读取或提取图形组件,如square,rect,line等? [英] how to read or extract graphical componenets such as square ,rect,line etc., from a pdf document using java?

查看:302
本文介绍了如何使用java从pdf文档中读取或提取图形组件,如square,rect,line等?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从使用iText生成的pdf文档中提取所有数据(如square,rect,line等)。但是我无法提取内容而不是文本和图像。我想要提取上面提到的图形组件。

I am trying to extract all datas( such as square ,rect,line etc.,) from a pdf document which was generated using iText.But I'm not able extract the content rather than text and image.I want to extract graphical components mentioned above.

推荐答案

这似乎有3个选项(至少那些是我能找到的) ,我不知道你到底有什么,所以我会粘贴所有的3,这些都在增加难度级别)

There seem to be 3 options for this (at least those are the ones I could find), I do not know what you exactly have, so I will paste all the 3, these are in increasing levels of difficulty)

第一选择:你可以这样做:(取自此处)

First Option: You could do something like so: (taken from here)

PDDocument document = null; 
document = PDDocument.load(inFile); 
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator(); 
while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            PDResources resources = page.getResources();
            Map pageImages = resources.getImages();
            if (pageImages != null) { 
                Iterator imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2OutputStream(/* some output stream */);
                }
            }
}

第二个选项可能是转换您的PDF文档为HTML,使用的内容与此处然后,使用 JSoup 来处理HTML和迭代 img 标签,这就是我假设图像将被渲染的方式。

Second option could be to convert your PDF document to HTML, using something along the lines of what is shown here and then, use JSoup to process the HTML and iterate over the img tags, which is how I am assuming that the images will be rendered.

或者,您可以查看 Hough Transform

Alternatively, you could take a look at the Hough Transform:


Hough变换是一种用于图像
分析的特征提取技术,计算机视觉和数字图像处理。该技术的
目的是通过投票程序在某一类形状中找到对象
的不完美实例。

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure.

成像库,例如 OpenCV 应该能够开箱即用( OpenCV-Java )作为此类库的Java包装器。

An imaging library, such as OpenCV should be able to yield such functionality out of the box (OpenCV-Java) being a Java wrapper for such library.

这个示例应该指向正确的方向。

This example should point you in the right direction.

这篇关于如何使用java从pdf文档中读取或提取图形组件,如square,rect,line等?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆