使用java从pdf中识别和提取表 [英] Identify and extract table from pdf using java

查看:147
本文介绍了使用java从pdf中识别和提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有不同类型的pdf,其中包含多个内容,如文本,表格等。表格可能存在于pdf(顶部,中间,底部)的任何位置。
我想使用java从pdf中仅提取表数据(列的编号,行的编号和表中的数据)而不传递位置。

I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom). I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location.

到目前为止我做了什么: -


1.我使用了iText java API来读取和提取。以下代码使用: -

What I have done till yet:-

1. I have used iText java API to read and extract. Following code used:-


PdfTextExtractor.getTextFromPage

PdfTextExtractor.getTextFromPage

但它只返回文本形式的数据。没有得到任何线索来确定pdf中哪些表存在以及如何从该表中提取数据。


2.我也使用过PDFBox java API,但它也没有解决我的问题。


3.我也跟着这个堆栈溢出链接: -
PDF表格提取
但它没有给我预期产出。此算法需要除行位置和所有。

but It is only returning data in form of text. Didn't get any clue to identify where table exists in pdf and how to extract data from that table.

2. I have also used PDFBox java API but it didn't solve my problem too.

3. I have also followed this stack overflow link:- PDF table extraction But it is not giving me expected output. This algorithm needs except line position and all.

我无法确定在pdf中找到该表的位置。

I am not able to identify where to locate the table in pdf.

有人可以告诉我如何使用iText& amp; PDF框API或者是否有任何开源API可以帮助我解决这个问题?

Can anybody tell me how to solve this problem using iText & PDF box API or is there any open source API which can help me to solve this problem?

或者我们可以将pdf转换为html,以便通过表格标签我们可以识别表格并阅读;)?

Or can we convert pdf into html so that by table tags we can identify table and read ;)?

推荐答案

这主要取决于您的输入文档,以及您愿意为此项目投入多少精力。

It basically depends on your input document, and how much effort you're willing to put into this project.

PDF格式不像html文档那样工作。在html文档中,您有逻辑标签,如table或paragraph。 pdf文档(在最基本的情况下)仅包含呈现文档所需的指令。
所以不是得到桌子,你可能会在这里划一条线,再走一条线,然后另一条穿过两条线,依此类推,等等

A pdf does not work like an html-document. In html documents you have logical tags like "table" or "paragraph". A pdf document (in the most basic case) contains only the instructions needed to render the document. So instead of getting "table" you might get "draw a line here, and another one a bit further away, and then another one that crosses both, and so on"

另外,根据pdf规范,这些说明甚至不必以逻辑(阅读)顺序出现。

Also, according to the pdf specification, these instructions don't even have to appear in logical (reading) order.

如果你很幸运,你的输入pdf可能是标记的PDF。标记的pdf包含文档中底层结构的内部表示。标记的pdf可能能够准确地告诉您文档中的哪些对象构成了表格。

If you are lucky, your input pdf might be a tagged PDF. Tagged pdfs contain an internal representation of the underlying structure in the document. A tagged pdf might be able to tell you exactly which objects in the document make up the table.

现在,回到实际答案。
如果您想要一个始终有效的解决方案,您可以实现iText7 IEventListener类。这个类有一个方法eventOccurred(),每当解析器处理完一个对象(比如一段文本,一行等)时就会调用它。

Now, to get back to an actual answer. If you want a solution that always works, you can implement the iText7 IEventListener class. This class has a method eventOccurred() that gets called every time the parser has finished dealing with an object (like a piece of text, a line, etc)

如果你然后寻找线条,并构建一些启发式方法来确定线条集合何时构成一个表格,你应该能够检测到表格。

If you then look out for lines, and build some heuristic to determine when a collection of lines constitutes a table, you should be able to detect tables.

IText还计划发布一个pdf2Data插件,基本上可以为你做繁重的工作。

IText also plans on releasing a pdf2Data addon, which will basically do the heavy lifting for you.

这篇关于使用java从pdf中识别和提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆