PDF表格提取 [英] PDF table extraction

查看:87
本文介绍了PDF表格提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有(相同)数据另存为GIF图像文件和PDF文件,我想将其解析为HTML或XML.数据实际上是我大学食堂的菜单.这意味着每周必须分析一个文件的新版本! 通常,文件包含一些页眉和页脚文本,以及之间包含其他数据的表. 我已经阅读了一些关于stackoverflow的文章,并且还开始尝试将表数据解析为HTML/XML:

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

  • PDFBox || iText(Java)
  • Google文档导入
  • PDF2HTML || PDF2Table

GIF

  • Tesseract-OCR

通过使用PDFBox解析PDF文件,我得到了最好的结果,但是仍然(由于菜单每周更改一次),它还不够可靠.我收到的HTML有时包含更多(有时更少)的段落"(<p>),因此我无法足够准确地解析数据.

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

这就是为什么我想知道是否还有另一种方法?

That is why I would like to know if there is an other way to do it?

推荐答案

Tabula 是一个用于从任意PDF提取CSV/TSV表.

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

这篇关于PDF表格提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆