CGPDFScannerPopString返回奇怪的结果 [英] CGPDFScannerPopString returning strange result

查看:171
本文介绍了CGPDFScannerPopString返回奇怪的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我终于得到某种pdf扫描仪了。它读入回调函数没有问题,但是当我尝试NSLog结果来自CGPDFScannerPopString时,我得到如下结果:

I finally got some sort of pdf scanner to work. It reads into the callback functions without a problem, but when I try to NSLog the result from a CGPDFScannerPopString I get a result like this:

ˆ ˛˝     #    ˜˜˜      #˜'  ˜˜˜      "˜   '˜˜      " '   ˜˜

这里找不到任何字符串...

No string to be found here...

有什么想法可以吗?
这是我的回调函数:

Any ideas of what it can be? This is my callback function:

static void op_Tj (CGPDFScannerRef s, void *info)
{
    CGPDFStringRef string;

    if (!CGPDFScannerPopString(s, &string))
        return;

    NSLog(@"string: %@", (__bridge NSString *)CGPDFStringCopyTextString(string));
}

非常感谢!

编辑:示例PDF

推荐答案

您应该知道CGPDFStringRef不是ASCII字符串或类似的东西。参看 http://developer.apple.com/library /mac/documentation/graphicsimaging/Reference/CGPDFString/Reference/reference.html ---它是0到255范围内的一系列字节无符号整数值,必须根据最新值进行解释PDF参考。

You should be aware that the CGPDFStringRef is not a ASCII string or something similar at all. Cf. http://developer.apple.com/library/mac/documentation/graphicsimaging/Reference/CGPDFString/Reference/reference.html --- it is a "series of bytes—unsigned integer values in the range 0 to 255" which have to be interpreted according to the latest PDF reference.

反过来,PDF参考将告诉您字节的解释取决于使用的字体,而类似ASCII的解释在欧洲的情况下很常见语言,它们不是强制性的,并且在亚洲语言中,字体子集嵌入很常见,解释可能看起来是随机的。

The PDF reference in turn will tell you that the interpretation of the bytes depends on the font used, and while ASCII-like interpretations are common in case of European languages, they are not mandatory, and in case of Asian languages where font subset embedding is very common, the interpretation may look random.

CGPDFStringCopyTextString尝试相应地解释这些字节,但是没有一个合理的解释作为常规字符串。

CGPDFStringCopyTextString tries to interpret those bytes accordingly, but there does not have to be a sensible interpretation as a regular string.

编辑检查样本PDF Ron提供的显示在这个样本的情况下,对象3 0中的字体编码(在文档的大多数页面上占主导地位)不是标准编码,而是:

EDIT Inspection of the sample PDF Ron supplied showed that in case of this sample indeed the encoding of the font in object 3 0 (which is dominant on most pages of the document) is not a standard encoding but instead:

<</Type/Encoding
  /Differences[0/.notdef/C/O/V/E/R/space/slash/H/L/F/underscore/W/B/five/eight/four
                /zero/two/six/D/one/period/three/Z/I/N/G/U/S/T/colon/seven/A/M/P/Y
                /plus/nine/X/hyphen/i/s/p/a/t/c/h/n/f/o/K/greater/equal/l/m/y/J/Q
                /parenleft/parenright/comma/dollar/ampersand/d/r/v/b/e/u/w/k/g/x/bar
                /quotesingle/asterisk/q/question/percent]
>>

查看第一个文档页面的顶部

Looking at the top of the first document page

COVER / HLF_CWEB_58408485 / 58408485 / 26DEC12 10.30.22Z


BRIEFING INCLUDES FOLLOWING FLIGHTS:

26DEC12 OR0337 EHAM0630 MUVR1710 PHOYE VSM+2/8 179

NEXT FLIGHTS OF AIRCRAFT:

26DEC12 OR0338 MUVR1830 MMUN1940 PHOYE VSM+2/8 213
26DEC12 OR0338 MMUN2105 EHAM0655 PHOYE GPT+2/7 263
27DEC12 OR0365 EHAM0900 TNCB1930 PHOYE BAH+1/8 272
27DEC12 OR0366 TNCB2030 TNCC2110 PHOYE BAH+1/8 250
27DEC12 OR0366 TNCC2250 EHAM0835 PHOYE ASD+1/8 199 

编码似乎是通过处理从下一个所需字形开始的下一个数字来创建的。这显然会导致高度个性化的编码......

that encoding seems to have been created by dealing out the next number starting from one for the next required glyph. This obviously results in a highly individualistic encoding...

据说字体对象确实包含/ Encoding条目和/ ToUnicode条目。因此,如果方法CGPDFStringCopyTextString在这里给出了对字体的引用并且真的尝试过,那么很容易就能够将这些字节正确地转换为相应的文本。它没有达到任何体面,似乎表明它根本没有用于解释字节的字体信息---我不认为它没有尝试......

That being said the font object does include both an /Encoding entry and a /ToUnicode entry. Thus, if the method CGPDFStringCopyTextString was given a reference to the font here and really tried, it would easily be able to correctly translate those bytes into the corresponding text. That it doesn't achieve anything decent, seems to indicate that it simply does not have the information which font to interpret the bytes for --- I don't assume it doesn't try...

因此,为了准确提取文本,您必须使用内容流中字体的信息自行解释CGPDFStringRef中的字节。如果您不想从头开始,可能会对 PDFKitten 感兴趣,这是一个提取框架来自iOS中PDF的数据。虽然它还不完美(某些字体结构可能令人困惑),但这是一个很好的起点。

For accurate text extraction, therefore, you have to interpret the bytes in the CGPDFStringRef yourself using the information of the the font in the content stream. If you don't want to do that from scratch, you might be interested in PDFKitten, a framework for extracting data from PDFs in iOS. While it is not yet perfect (some font structures can baffle it), it is a good starting point.

这篇关于CGPDFScannerPopString返回奇怪的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆