在 iPhone 上搜索 PDF [英] PDF search on the iPhone

查看:23
本文介绍了在 iPhone 上搜索 PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用 Quartz 从 PDF 中读取注释两天后,我设法做到了并且 发布我的代码.

After two days trying to read annotations from a PDF using Quartz, I've managed to do it and posted my code.

现在我想对另一个常见问题做同样的事情:用 Quartz 搜索 PDF 文档.和以前一样的情况,这个问题被问了很多次,几乎没有实际的答案.所以我首先需要一些指示,因为我自己还没有实现.

Now I'd like to do the same for another frequently asked question: searching PDF documents with Quartz. Same situation as before, this question has been asked many times with almost no practical answers. So I need some pointers first, as I still haven't implemented this myself.

我的尝试:

我尝试使用 CGPDFScannerScan 处理 TJTj 运算符 - 在某些 PDF 上返回正确的文本,而在其他文档上它主要返回随机字母.可能与文本编码有关?有人 指出 应该处理文本块(由 BT/ET 运营商标记),但我仍然没有设法这样做.有人设法从任何 PDF 中提取文本吗?

I tried using CGPDFScannerScan handling the TJ and Tj operators - returns the right text on some PDF, whereas on other documents it returns mostly random letters. Maybe it's related to text encoding? Someone pointed out that text blocks (marked by BT/ET operators) should be handled instead, but I still haven't managed to do so. Anyone managed to extract text from any PDF?

之后,通过将所有文本存储在 NSMutableString 中并使用 rangeOfString(如果有更好的方法请告诉我),搜索应该很容易.

After that, searching should be easy by storing all the text in a NSMutableString and using rangeOfString (if there's a better way please let me know).

但是如何突出显示结果?我知道有一些运算符可以找到字形大小,所以我可以根据这些值计算结果矩形,但我已经阅读了几个小时的规范......这是一个臃肿的烂摊子,我快疯了.任何有实际解释的人?

But then how to highlight the result? I know there are a few operators to find the glyph sizes, so I could calculate the resulting rect based on those values, but I've been reading the spec for hours... it's a bloated mess and I'm going insane. Anyone with a practical explanation?

用户 Naveen Thunga 发现了 PDFKitten,一个在 iOS 中从 PDF 中提取数据的框架".我刚刚尝试了演示,它似乎像宣传的那样工作.我将用更多的 PDF 对其进行测试,并将很快发布结果.附带说明一下,代码对我来说似乎非常好——如果您对这些东西的工作原理感兴趣,那就太棒了.

User Naveen Thunga found PDFKitten, "a framework for extracting data from PDFs in iOS". I just tried the demo and it seems to work as advertised. I will test it with more PDFs and will post the results soon. As a side note, the code seems very good to me -- if you are interested in how this stuff works it's pretty awesome.

推荐答案

这不是一个实施起来很简单的问题,但很简单.

This isn't a simple problem to implement, but it is straightforward.

对于任何给定的页面,您需要使用 CGPDF 扫描仪 API 扫描页面.您需要为影响页面中文本的 PDF 运算符注册回调 - 不仅仅是 TJ/Tj,还有那些设置字体、影响文本绘制矩阵等的回调.您需要构建一个状态机,随着每个遇到的 tag+ 进行更新参数.您需要检查当前字体编码的文本.当您找到要突出显示的文本时,您需要检查您一直在更新的当前文本绘制矩阵以确定绘制坐标.阅读 PDF 规范(可从 Adob​​e 下载 1.7 版),了解您需要注意哪些运算符.

For any given page you need to scan the page using the CGPDF scanner API. You need to register callbacks for PDF operators that affect text in the page - not just TJ/Tj, but also those that set font, affect the text drawing matrix, etc. You need to build a state machine that updates with each encountered tag+parameters. You need to examine text accounting for the current font's encoding. When you find text that you want to highlight, you'll need to examine the current text drawing matrix you've been updating to determine the drawing coordinates. Read the PDF specification (version 1.7 is downloadable from Adobe) to understand which operators you need to pay attention to.

字体编码可能是最困难的部分,因为可以通过多种方式指定编码,其中一些是字体专有的.大多数情况下,您可以欺骗并退回到 ANSI 编码的子集 - 但这会破坏某些具有奇怪字体的 PDF.

Font encoding is perhaps the most difficult part since there are a handful of ways encoding can be specified, and some of them are proprietary to the font. Mostly you can cheat and fall back on a subset of ANSI encoding - but this WILL break on certain PDFs having strange fonts.

本质上,您正在处理页面,就像您要呈现它一样.

Essentially you are processing the page as if you were to render it.

这篇关于在 iPhone 上搜索 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆