从PDF文件集中提取表格内容 [英] Extracting table contents from a collection of PDF files

查看:118
本文介绍了从PDF文件集中提取表格内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆PDF-可能数百或数千.它们的格式并非全部相同,但是它们中的任何一个都可能具有一个或多个表,这些表中包含我想收集到单独数据库中的有趣信息.

当然,我知道我必须写一些东西才能做到这一点. Perl是我的选择-也许是Java.只要是免费的,我就不在乎什么语言(或者免费试用,以确保它适合我的目的).

我正在查看CAM :: Parse(使用Strawberry Perl),但是我不确定如何使用它来从文件中查找和提取表.我想我确实偏爱Perl,但实际上我想要一种性能可靠且相当容易进行字符串操作的东西.

对这样的事情有什么好的方法?我在一个广场上,所以如果java(或python等)有更好的钩子,那么现在是了解它的好时机.一般指针很好;首选入门代码.

解决方案

  1. PDF格式诞生之初 (已有20多年的历史了) 从不打算用作可提取的,有意义的结构的宿主数据 .

  2. 其目的是为文档中的文本,图像和图表提供可靠的视觉表示-一种数字纸(也可以通过打印可靠地转换为真实纸).直到其开发的后期,才添加了更多功能,这些功能应有助于再次提取数据(Google为 Tagged PDF 所用).

  3. 有关从PDF抓取表格时出现的问题的一些示例,请参见本文:

  4. 与我的观点"1"矛盾. ,现在我要说的是:对于一个令人惊奇的工具系列,它们从每周提取PDF的表格数据(除非它们是扫描的页面)的过程中逐周变得越来越好,请参见以下链接:

所以:去寻找Tabula.如果有任何工具可以满足您的需求,那么Tabula可能是最适合的工作!


更新

我最近创建了一个 ASCiinema屏幕录像,该示例演示了如何使用Tabula命令行界面从PDF中将大表提取为CSV:

(点击上方的图片可查看其运行情况.如果运行速度太快而无法阅读所有文本,请使用暂停" 按钮( || -符号).

它在这里托管:

I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database.

Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes).

I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a preference for Perl, but really I want something that works dependably and is reasonably easy to do string manipulations with.

What is a good approach for something like this? I'm at square one, so if java (or python etc.) have better hooks, now is a good time to know about it. General pointers good; starter code would be strongly preferred.

解决方案

  1. The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.

  2. Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).

  3. For some examples of problems which are posed when data scraping tables from PDFs, see this article:

  4. Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:

So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job!


Update

I've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV:

(Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (||-symbol).)

It is hosted here:

这篇关于从PDF文件集中提取表格内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆