PDF表格提取 [英] PDF table extraction

查看：87 发布时间：2020/5/25 4:06:28 pdf pdfbox extraction

本文介绍了PDF表格提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有(相同)数据另存为GIF图像文件和PDF文件，我想将其解析为HTML或XML.数据实际上是我大学食堂的菜单.这意味着每周必须分析一个文件的新版本！通常，文件包含一些页眉和页脚文本，以及之间包含其他数据的表. 我已经阅读了一些关于stackoverflow的文章，并且还开始尝试将表数据解析为HTML/XML:

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

PDFBox || iText(Java)
Google文档导入
PDF2HTML || PDF2Table

GIF

Tesseract-OCR

通过使用PDFBox解析PDF文件，我得到了最好的结果，但是仍然(由于菜单每周更改一次)，它还不够可靠.我收到的HTML有时包含更多(有时更少)的段落"(<p>)，因此我无法足够准确地解析数据.

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

这就是为什么我想知道是否还有另一种方法?

That is why I would like to know if there is an other way to do it?

PDF表格提取 [英] PDF table extraction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PDF表格提取 [英] PDF table extraction

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭