C ++中的PDF解析(PoDoFo) [英] PDF parsing in C++ (PoDoFo)

查看:3131
本文介绍了C ++中的PDF解析(PoDoFo)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,所以我试图解析一些文本从一些pdfs,我想使用PoDoFo,现在我已经尝试搜索如何使用PoDoFo解析一个pdf的示例,但我可以想出的例子是如何创建和写一个不是我真正需要的pdf文件。



如果任何人有任何教程或使用PoDoFo解析PDF文件的示例,不同的库,我可以使用请让我知道。还有我知道在linux有pdftotext,但是,不仅我不能使用,但我更愿意能够做所有我需要内部,而不依赖于外部程序安装。


$ d










$ b b

将文档加载到 PdfMemDocument 中:

  PoDoFo: :PdfMemDocument pdf(mydoc.pdf); 

迭代每个页面:

  for(int pn = 0; pn  PoDoFo :: PdfPage * page = pdf.GetPage(pn); 

迭代该页面上的所有PDF命令:

  PoDoFo :: PdfContentsTokenizer tok(page); 
const char * token = nullptr;
PoDoFo :: PdfVariant var;
PoDoFo :: EPdfContentsType type;
while(tok.ReadNext(type,token,var)){
if(type == PoDoFo :: ePdfContentsType_Keyword){
// process type,token& var
}
}
}



<令牌& var是一个更复杂的地方。您将获得要处理的原始PDF命令。幸运的是,如果你实际上不是渲染页面,你想要的是文本,你可以忽略大多数。您需要处理的命令是:



BT ET Td TD Ts T Tm Tf ' Tj TJ



BT ET 因此您希望忽略不在 BT / ET 对之间的任何内容。 / p>

PDF语言是基于RPN的,命令流由被压入堆栈的值组成,并命令从栈中弹出值并处理它们。至少有一个参数,一个参数将在 var 对象中。



code>' TJ 命令是唯一真正生成文本的命令。 ' Tj 返回单个字符串。 var.IsString() var.GetString()来处理。



TJ 返回一个字符串数组,你可以用下面的方法解压:

  if(var.isArray()){
PoDoFo :: PdfArray& a = var.GetArray();
for(size_t i = 0; i if(a [i] .IsString())
//使用[i] .GetString()执行操作

其他命令用于确定何时引入换行符。'也引入换行符。最好的办法是从Adobe下载PDF规范并查找文本处理部分。



我发现写一个小程序,它需要一个PDF文件和转储每个页面的命令流是非常有帮助的。


Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need.

If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside programs being installed.

解决方案

PoDoFo does not provide a means to easily extract text from a document, but it is not hard to do.

Load a document into a PdfMemDocument:

PoDoFo::PdfMemDocument pdf("mydoc.pdf");

Iterate over each page:

for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
    PoDoFo::PdfPage* page = pdf.GetPage(pn);

Iterate over all the PDF commands on that page:

    PoDoFo::PdfContentsTokenizer tok(page);
    const char* token = nullptr;
    PoDoFo::PdfVariant var;
    PoDoFo::EPdfContentsType type;
    while (tok.ReadNext(type, token, var)) {
        if (type == PoDoFo::ePdfContentsType_Keyword) {
            // process type, token & var
        }
    }
}

The "process type, token & var" is where it gets a little more complex. You are given raw PDF commands to process. Luckily, if you're not actually rendering the page and all you want is the text, you can ignore most of them. The commands you need to process are:

BT, ET, Td, TD, Ts, T, Tm, Tf, ", ', Tj and TJ

The BT and ET commands mark the beginning and end of a text stream, so you want to ignore anything that's not between a BT/ET pair.

The PDF language is RPN based. A command stream consists of values which are pushed onto a stack and commands which pop values off the stack and process them. All the commands you need to process have, at most, one parameter. That one parameter will be in the var object.

The ", ', Tj and TJ commands are the only ones which actually generate text. ", ' and Tj return a single string. Use var.IsString() and var.GetString() to process it.

TJ returns an array of strings. You can extract each one with:

if (var.isArray()) {
    PoDoFo::PdfArray& a = var.GetArray();
    for (size_t i = 0; i < a.GetSize(); ++i)
        if (a[i].IsString())
            // do something with a[i].GetString()

The other commands are used to determine when to introduce a line break. " and ' also introduce line breaks. Your best bet is to download the PDF spec from Adobe and look up the text processing section. It explains what each command does in more detail.

I found it very helpful to write a small program which takes a PDF file and dumps out the command stream for each page.

这篇关于C ++中的PDF解析(PoDoFo)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆