使用Quartz 2D解析pdf时获取文本位置 [英] Getting text position while parsing pdf with Quartz 2D

查看:26
本文介绍了使用Quartz 2D解析pdf时获取文本位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于pdf解析的另一个问题...刚刚看了PDF Reference version 1.75.3.1 Text-Positioning Operators",我有点糊涂了.

another question regarding pdf parsing... Just read PDF Reference version 1.7 "5.3.1 Text-Positioning Operators" and I am a little bit confused.

我写了一些代码来获取转换矩阵和初始文本位置.

I wrote some code to get transformation matrix and initial text position.

        CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
    CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
    CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
    CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
    CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence

    //Text State operators
    CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
    CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
    CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
    CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
    CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
    CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
    CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);

    //text showing operators
    CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
    CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
    CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
    CGPDFOperatorTableSetCallback(table, """, &op_double_apostrof);

    //text positioning operators        
    CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
    CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
    CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
    CGPDFOperatorTableSetCallback(table, "T*", &op_T);

    //text object operators
    CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
    CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object

这是申请午餐后的输出:

So this is the output after application lunch:

    2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
    Integer value: 0
    2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
    2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
    Integer value: 1
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
    2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
    Float value: 557.364197
    2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
    2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
    2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
    2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
    2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
    2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
    2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
    2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
    2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
    2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
    2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
    2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
    2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
    2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
    2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
    2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
    2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
    2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
    2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
    2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
    2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
    2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
    2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
    2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
    2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
    2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
    2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
    2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
    2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
    2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
    2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end

如果有人熟悉文本矩阵和文本定位运算符,最好能解释一下所有这些东西是如何工作的.

If someone is familiar with text matrix and text positioning operators it would be nice to explain how all those thing work.

如何使用 Tm(变换矩阵和其他数据)计算文本位置(或字形?)?

How to calculate text position (or glyph?) using Tm (transformation matrix and other data)?

推荐答案

@Koteg : 嗨!你终于设法让它工作了吗?对于 Tm,我可以获得所有六个值,但现在我不知道如何将单词的位置放入一行中......我有一个想法:如果我们在 Tj 中,只需获取字母之间的空格(每次都一样)并使用 Tm 获取单词的位置.在 TJ 的情况下,这相当复杂:获取水平平移的值以减去数组的每个部分的 Tm 矩阵,但在该数组中搜索单词将比 Tj 更复杂.

@Koteg : Hi ! Have you finally managed to get it work ? For Tm, i'm able to get all the six values, but for now i can't see how to get the position of a word into a line ... I have an idea : if we are in Tj, just get the space between letters (hopping this the same everytime) and with Tm, get the position of a word. In the case of TJ, this is quite more complicated : get the value of horizontal translation to substract to Tm matrix for each part of the array, but searching a word in that array will be more complicated than for Tj.

顺便说一句,对于其他人:

BTW, for others people :

for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2)
{
    if(n >= CGPDFArrayGetCount(array))
        continue;

    CGPDFStringRef string;
    success = CGPDFArrayGetString(array, n, &string);
    if(success)
    {
        NSString *data = (NSString *)CGPDFStringCopyTextString(string);
        NSLog(@"array data : %@", data);

        [searcher.currentData appendFormat:@"%@", data];
        [data release];
    }

    CGPDFReal real;
    success = CGPDFArrayGetNumber(array, n+1, &real);
    if(success)
    {
        NSLog(@"array real : %f", real);
    }
}

谢谢

这篇关于使用Quartz 2D解析pdf时获取文本位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆