如何从pdf编码的identity-h获取文本 [英] how to get text from identity-h encoded from pdf

查看:1026
本文介绍了如何从pdf编码的identity-h获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我成功使用TJ,Tj运算符回调从pdf获取文本...但是仍然缺少一些文本身份编码的文本..如何将其转换为text / NSString ???

i succeed to get text from pdf using TJ,Tj operator Callbacks ... but some texts are still missing which are identity-h encoded .. how to convert it to text/NSString ???

推荐答案

Identity-H编码意味着Type0字体(也称为CID键控字体),因此您必须参考嵌入式ToUnicode映射。你在TJ,Tj,单引号和双引号(四个文本显示运算符)中获得的字符不是unicode,而是在当前字体之外没有任何意义的任意字符ID。

Identity-H encoding implies a Type0 font (also known as a CID-keyed font), so you have to consult the embedded ToUnicode mapping. The characters you get in TJ, Tj, single quotation and double quotation (the four text-showing operators) are not unicode, but rather arbitrary character IDs that have little meaning outside the current font.

PDF规范文档非常清晰,但阅读要求非常高。

The PDF specification document is very clear, but quite a demanding read.

这篇关于如何从pdf编码的identity-h获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆