如何在跟踪其结构的同时从PDF文件提取数据? [英] How to extract data from a PDF file while keeping track of its structure?

查看:97
本文介绍了如何在跟踪其结构的同时从PDF文件提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是在解析PDF文件的结构的同时提取其文本和图像.解析结构的范围并不详尽.我只需要能够识别标题和段落.

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

我尝试了几种不同的方法,但是在任何一种方法中我并没有走得很远:

I have tried a few of different things, but I did not get very far in any of them:

  • 将PDF转换为文本.这对我来说不起作用,因为我丢失了图像和文档的结构.
  • 将PDF转换为HTML.我发现了一些可以帮助我解决问题的工具,到目前为止最好的工具是pdftohtml.该工具确实非常适合演示,但是我无法成功解析HTML.
  • 将PDF转换为XML.和上面一样.

有人对如何解决此问题有任何建议吗?

Anyone has any suggestions on how to tackle this problem?

推荐答案

基本上没有简单的剪切和粘贴解决方案,因为PDF对结构不是很感兴趣.这个站点上还有许多其他答案,它们将使您更详细地了解事情,但是这一点应该为您提供主要要点:

There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

如果要识别PDF文档中的文本结构是如此困难,那么PDF阅读器如何做到这一点呢?

如果您想在PDF本身中进行此操作(您将对该流程拥有大部分控制权),则必须遍历页面上的所有文本,并通过查看其标题属性(使用的字体,相对于页面上其他文本的大小等).

If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

最重要的是,您还必须通过查看文本片段的位置,页面上的空白,某些字母,单词和行的紧密程度来识别段落... PDF本身甚至没有单词"的概念,更不用说行"或段落"了.

On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

要使事情更加复杂,在页面上绘制文本的方式(因此,在PDF文件本身中出现的顺序)甚至不必是正确的阅读顺序(或人类所认为的).是正确的阅读顺序).

To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

这篇关于如何在跟踪其结构的同时从PDF文件提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆