PoDoFo从pdf提取文本+坐标 [英] PoDoFo Extract text + coords from a pdf

查看:506
本文介绍了PoDoFo从pdf提取文本+坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试了一段时间使用PoDoFo C ++库来提取文本和行(以及它们各自的坐标).但我无能为力.

这是我到目前为止所拥有的:

#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;

int main( int argc, char* argv[] )
{
    const char* filename = "hello.pdf";
    PdfVecObjects *x = new PdfVecObjects();
    PdfParser parser(x, filename);
    parser.ParseFile("hello.pdf");

    for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
        PdfObject * a = x->RemoveObject(obj);
        // THIS IS MY PROBLEM VVVVVVVVVV
        cout << a->Reference().ToString() << endl;
    }

    return 0;
}

但是,这仅给了我令人难以置信的基本信息(似乎是对象编号)

DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R

我想打印出对象的坐标,如果它是线或文本.如果是文本,我也希望能够打印出文本.有谁比我更了解这个库,所以我能做些什么来解决这个问题?

解决方案

答案将向您展示如何提取文字.

要获取文本定位信息,您还必须处理以下命令:

TcTwTzTLT*TrTm.

您肯定需要从Adobe下载 PDF规范以获得全部细节.有一章专门讨论文本处理.打印该章是很值得的,因为您将其称为 lot .您需要知道的所有内容都在其中,但并不总是显而易见的.

您还需要使用一些线性代数.没什么复杂的.

由于有许多方法可以达到相同的结果,因此即使要处理的文档似乎不需要某些功能,也必须彻底实施所有命令,这一点很重要.例如:我遇到了一个文档,该文档将所有文本大小都设置为一个点,这使我的所有计算中断,直到我意识到它正在使用文本缩放比例来设置实际字体大小.

I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.

This is what I have so far:

#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;

int main( int argc, char* argv[] )
{
    const char* filename = "hello.pdf";
    PdfVecObjects *x = new PdfVecObjects();
    PdfParser parser(x, filename);
    parser.ParseFile("hello.pdf");

    for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
        PdfObject * a = x->RemoveObject(obj);
        // THIS IS MY PROBLEM VVVVVVVVVV
        cout << a->Reference().ToString() << endl;
    }

    return 0;
}

However, this only gives me incredibly basic information (seems to be object number)

DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R

I want to print out the coordinates of an object, and if it's a line or text. If it's text, I would also like to be able to print out the text. Does anyone that knows this library better than I do know what I could do to fix this?

解决方案

This answer will show you how to extract the text.

To get text positioning information, you will also have to process the following commands:

Tc, Tw, Tz, TL, T*, Tr and Tm.

You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.

You will also need to use a bit of Linear Algebra. Nothing too complicated, though.

Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.

这篇关于PoDoFo从pdf提取文本+坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆