PDF全文搜索iPad上与Quartz 2D [英] Pdf full text search on iPad with Quartz 2D

查看:142
本文介绍了PDF全文搜索iPad上与Quartz 2D的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好:) 我试图使用Quartz 2D来实现全文搜索,但它是一个噩梦。 我可以提取从PDF页面使用PDF文本操作(TJ等...)

  CGPDFOperatorTableRef为myTable;

为myTable = CGPDFOperatorTableCreate();

CGPDFOperatorTableSetCallback(myTable的,BT,&安培; op_BT);
CGPDFOperatorTableSetCallback(myTable的,Td的,和放大器; op_Td);
CGPDFOperatorTableSetCallback(myTable的,TD,和放大器; op_TD);
CGPDFOperatorTableSetCallback(myTable的,以旧换新,和放大器; op_Tm);
CGPDFOperatorTableSetCallback(myTable的,T *,&安培; op_T);
CGPDFOperatorTableSetCallback(myTable的,TJ,&安培; op_TJ);
CGPDFOperatorTableSetCallback(myTable的,铁蛋白,&安培; op_TF);
CGPDFOperatorTableSetCallback(myTable的,ET,和放大器; op_ET);
 

但在同一时间,我需要强调的PDF页面的匹配与一些长方形像它在Safari为例进行。 任何建议如何实现这一点? 有一些解决方案不需要这样巨大的工作?

解决方案

这是只是冰山的尖端...

检测的字节恩codeD在TJ并不意味着你已经文字,甚至可以将其转换回的。

在的基础上制定的文本有一个积极的字体(TF)PDF。字体有一个编码 - 有很多不同的编码的四周,有些不是,你可以从它那里得到一个单code意义上的可逆。

如果你有一个可逆编码这很好。它仍然是实现反向查找大量的工作(特别是对多字节编码..)但你做了一个晴朗的一天。

如果您编码不那么聪明,你可能仍然有一个额外的/ ToUni code的地图,可以计算一个单code。一个额外的努力,但现在你的罚款。

这些映射到UNI code ...

...除了围绕支持许多现有的文件既没有

...和毕竟:PDF不包含在这个意义上文本,它绘制字符。所以,从理论上讲,你必须画的人物在虚拟页面之前,你可以在任何读取顺序进行排序......

所有的一切,它非常有趣。

Hi guys :) I am trying to implement full text search using Quartz 2D but it's a nightmare. I can "extract" text from pdf page using PDF Operator (TJ and other...)

    CGPDFOperatorTableRef myTable;

myTable = CGPDFOperatorTableCreate();

CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);
CGPDFOperatorTableSetCallback (myTable, "Td", &op_Td);
CGPDFOperatorTableSetCallback (myTable, "TD", &op_TD);
CGPDFOperatorTableSetCallback (myTable, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback (myTable, "T*", &op_T);
CGPDFOperatorTableSetCallback (myTable, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback (myTable, "Tf", &op_TF);
CGPDFOperatorTableSetCallback (myTable, "ET", &op_ET);

But in the same time I need to highlight a match on PDF page with some rectangle like it's done in Safari for example. Any suggestions how to implement this? Is there some solutions that don't require to such immense work?

解决方案

This is only the tip of the iceberg...

Detecting the "bytes" encoded in a TJ does not mean that you have already "text" or even are able to convert it back at all.

In PDF upon drawing text there's an "active" font (Tf). The font has an encoding - there are a lot of different encodings around and some are not "invertible" in the sense that you can get a unicode from it.

If you have an "invertible" encoding that's fine. It is still much work to implement the reverse lookup (especially for the multi byte encodings..) but one fine day you're done.

If your encoding is not so smart, you may still have an additional /ToUnicode map that allows to compute a unicode. An additional effort, but now your fine.

...besides the many existing documents around that support neither of these mappings to unicode...

...and after all: PDF does not contain "text" in that sense, it draws characters. So in theory you must draw the characters in a virtual page before you can sort them in any readable order...

All in all, its much fun.

这篇关于PDF全文搜索iPad上与Quartz 2D的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆