从PDF中提取表格 [英] Extract table from a PDF

查看:148
本文介绍了从PDF中提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从pdf 文档

I am trying to extract a table from a pdf document

我尝试了pdf-> html->提取表的路由.我在上面提到的pdf转换为html时会产生垃圾,可能是因为字体的原因,该文档不是英文的.

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in english.

使用x和y坐标提取pdf并不是一种选择,因为此解决方案需要从上面提到的网址中获取将来的pdf,它将具有表格,但并不总是在同一位置.

Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.

请帮助

谢谢.

推荐答案

PDF不包含显式表数据.它只包含我们倾向于解释为表格的线条和字符字形.因此,您的任务涉及将我们的人工表识别功能放入代码中,这是一项艰巨的任务.

The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

通常来说,如果您确定同一软件将来会以非常相似的方式生成PDF,则可能值得花时间研究该文件,以获取一些易于理解的提示识别各个字段的内容.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

不过,您的特定文档还有一个缺点:它不包含直接提取文本所必需的信息!您可以尝试复制和复制文档.从Adobe Reader粘贴,您将(至少我会)从WinAnsi范围中获得半随机字符.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

这是由于以下事实:即使该方式最终引用的字符并非来自WinAnsi字符选择,文档中的所有字体都声称它们使用WinAnsiEncoding.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

因此,从没有OCR的文档中可靠地提取文本毕竟是不可能的!

Thus reliable text extraction from your document without OCR is impossible after all!

(尝试从Adobe Reader复制和粘贴通常是一个很好的第一个测试,该测试是否完全可以进行文本提取; Reader的文本提取方法已经开发了很多年,因此已经变得相当不错.无法使用Acrobat Reader提取任何有意义的内容,文本提取确实是一项非常困难的任务.)

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

这篇关于从PDF中提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆