使用Poppler(C ++)从PDF提取文本 [英] Extracting text from PDF with Poppler (C++)

查看:951
本文介绍了使用Poppler(C ++)从PDF提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想尝试通过Poppler及其(缺少)文档。

I'm trying to get my way through Poppler and its (lack of) documentation.

我想做的是一个非常简单的事情:打开PDF文件并读取其中的文本。

What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here.

所以...我看到了 poppler_page_get_text

So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rectangle, which is not very handy. Isn't there just a very simple function that would output the PDF text in order (maybe line by line?).

推荐答案

这是一个非常简单的函数,可以按顺序输出PDF文本您应该能够将选择矩形设置为页面的 pageSize / MediaBox ,并获取所有文本。

You should be able to set the selection rectangle to the pageSize/MediaBox of the page and get all the text.

我说应该是因为在你开始想知道为什么你会惊讶于 poppler_page_get_text 的输出,你应该知道文本如何在一个页面上布局。所有图形使用以后缀符号表示的程序布置在页面上。

I say should because before you start wondering why you get surprised by the output of poppler_page_get_text, you should be aware of how text gets laid out on a page. All graphics are laid out on a page using a program expressed in post-fix notation. To render the page, this program is executed on a blank page.

程序中的操作可以包括改变颜色,位置,当前变换矩阵,绘制线,贝塞尔曲线等等。文本由一系列文本操作符布局,这些文本操作符总是用BT(开始文本)和ET(结束文本)括起来。文本在页面上的放置方式或位置由生成PDF的软件自行决定。例如,对于打印驱动程序,代码响应对 DrawString 的GDI调用,并将其转换为文本绘制操作。

Operations in the program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves and so on. Text is laid out by a series of text operators that are always bracketed by BT (begin text) and ET (end text). How or where text is placed on a page is at the sole discretion of the software that generates the PDF. For example, for print drivers, the code responds to GDI calls for DrawString and translates that into text drawing operations.

如果你幸运的话,页面上的文本是以一个正常的秩序与正常的字体使用,但许多程序,生成PDF不是那么好。 Psroff ,例如,喜欢将所有纯文本放在首位,然后是斜体文本,然后是粗体文本。词语可以或不可以按阅读顺序排列。字体可以重新编码,以便'a'映射到'{'或任何。然后你可能有连字,其中多个字符被单字形替代 - 最常见的是 ae oe code> fi fl ffl

If you are lucky, the text on the page is laid out in a sane order with sane font usage, but many programs that generate PDF aren't so kind. Psroff, for example liked to place all the plain text first, then the italic text, then the bold text. Words may or may not be placed in reading order. Fonts may be re-encoded so that 'a' maps to '{' or whatever. Then you might have ligatures where multiple characters are replaced by single glyphs - the most common ones are ae, oe, fi, fl, and ffl.

所有这一切都到位,提取文本的过程是非常简单的,所以不要惊讶,如果你看到质量差的文本提取结果。

With all of this in place, the process of extracting text is decidedly non-trivial, so don't be surprised if you see poor quality results from text extraction.

我用来处理Acrobat 1.0和2.0中的文本提取工具 - 这是一个真正的挑战,正确。

I used to work on the text extraction tools in Acrobat 1.0 and 2.0 - it's a real challenge to get right.

这篇关于使用Poppler(C ++)从PDF提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆