在iPhone上搜索PDF [英] PDF search on the iPhone

查看:134
本文介绍了在iPhone上搜索PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在尝试使用Quartz从PDF读取注释两天后,我设法做到了这一点并且发布了我的代码

After two days trying to read annotations from a PDF using Quartz, I've managed to do it and posted my code.

现在我想对另一个常见问题做同样的事情:用Quartz搜索PDF文档。与以前相同的情况,这个问题已被多次询问,几乎没有实际答案。所以我首先需要一些指针,因为我自己还没有实现。

Now I'd like to do the same for another frequently asked question: searching PDF documents with Quartz. Same situation as before, this question has been asked many times with almost no practical answers. So I need some pointers first, as I still haven't implemented this myself.

我尝试了什么:

我尝试使用 CGPDFScannerScan 处理 TJ Tj 运算符 - 返回某些PDF上的正确文本,而在其他文件上,它主要返回随机字母。 也许它与文本编码有关?
有人指出应该处理文本块(由BT / ET运营商标记),但我仍然没有设法这样做。有人设法从任何PDF中提取文本吗?

I tried using CGPDFScannerScan handling the TJ and Tj operators - returns the right text on some PDF, whereas on other documents it returns mostly random letters. Maybe it's related to text encoding? Someone pointed out that text blocks (marked by BT/ET operators) should be handled instead, but I still haven't managed to do so. Anyone managed to extract text from any PDF?

之后,通过将所有文本存储在 NSMutableString 并使用 rangeOfString (如果有更好的方法请告诉我。)

After that, searching should be easy by storing all the text in a NSMutableString and using rangeOfString (if there's a better way please let me know).

但是接着如何突出显示结果?我知道有一些运算符可以找到字形大小,所以我可以根据这些值来计算得到的矩形,但我已经读了几个小时的规格......这是一个臃肿的混乱,我疯了。有实际解释的人吗?

But then how to highlight the result? I know there are a few operators to find the glyph sizes, so I could calculate the resulting rect based on those values, but I've been reading the spec for hours... it's a bloated mess and I'm going insane. Anyone with a practical explanation?

用户Naveen Thunga发现 PDFKitten ,用于从iOS中的PDF中提取数据的框架。我刚刚尝试了这个演示,它似乎像宣传的那样工作。我将用更多的PDF测试它,并很快发布结果。作为旁注,代码对我来说似乎非常好 - 如果你对这些东西的工作方式感兴趣,那就太棒了。

User Naveen Thunga found PDFKitten, "a framework for extracting data from PDFs in iOS". I just tried the demo and it seems to work as advertised. I will test it with more PDFs and will post the results soon. As a side note, the code seems very good to me -- if you are interested in how this stuff works it's pretty awesome.

推荐答案

这不是一个简单的实现问题,但它很简单。

This isn't a simple problem to implement, but it is straightforward.

对于任何给定的页面,您需要使用CGPDF扫描仪API扫描页面。您需要注册影响页面中文本的PDF操作符的回调 - 不仅仅是TJ / Tj,还有那些设置字体,影响文本绘图矩阵等的操作。您需要构建一个状态机,用每个遇到的标记进行更新+参数。您需要检查当前字体编码的文本记帐。当您找到要突出显示的文本时,您需要检查您一直在更新的当前文本绘图矩阵以确定绘图坐标。阅读PDF规范(可从Adobe下载1.7版本)以了解您需要注意哪些运营商。

For any given page you need to scan the page using the CGPDF scanner API. You need to register callbacks for PDF operators that affect text in the page - not just TJ/Tj, but also those that set font, affect the text drawing matrix, etc. You need to build a state machine that updates with each encountered tag+parameters. You need to examine text accounting for the current font's encoding. When you find text that you want to highlight, you'll need to examine the current text drawing matrix you've been updating to determine the drawing coordinates. Read the PDF specification (version 1.7 is downloadable from Adobe) to understand which operators you need to pay attention to.

字体编码可能是最困难的部分,因为有一个可以指定一些编码方式,其中一些是字体专有的。大多数情况下,你可以作弊并依赖于ANSI编码的一个子集 - 但这会破坏某些具有奇怪字体的PDF。

Font encoding is perhaps the most difficult part since there are a handful of ways encoding can be specified, and some of them are proprietary to the font. Mostly you can cheat and fall back on a subset of ANSI encoding - but this WILL break on certain PDFs having strange fonts.

基本上你正在处理页面,就好像你是渲染它。

Essentially you are processing the page as if you were to render it.

这篇关于在iPhone上搜索PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆