iOS PDF到纯文本解析器 [英] iOS PDF to plain text parser

查看:223
本文介绍了iOS PDF到纯文本解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这个问题上我很茫然.我在这里阅读了几乎所有有关它的文章,如果有人朝正确的方向推动我,我将非常感激.

I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.

我有一个PDF,我想提取它的文本,我只对单词和空格感兴趣.我已经设置了CGPDFScanner及其回调方法.我所读的是,就提取文本而言,我只需要考虑4个运算符TJ,Tj,qout(')和doubleqout().

I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.

我想我还需要跟踪文本空间,以便确定是否应该将字母放在一起形成一个单词或应该用空格隔开.但是我不知道该怎么做.

I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.

在PDF中,所有文本均为以下格式

In the PDF, all text is in the format

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

但是我无法(使用PDF规范)弄清楚这些数字的含义.因此,有人说您不应该害怕PDF规范,但坦率地说,我并不觉得它们很容易阅读/理解.

but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.

我研究了有用的PDFKitten代码.

I have studied the PDFKitten code which was helpful.

任何帮助将不胜感激.

推荐答案

我不能给您建议如何从PDF中提取单词,但是格式为

I cannot give you advice how to extract words from PDF, but the format of

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

例如, "noreferrer"> PDF 1.7规范中的"9.4.3文本显示运算符"部分. TJ运算符的描述为:

is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ operator is:

显示一个或多个文本字符串,允许单独设置字形. 数组的每个元素应为字符串或数字.如果 element是一个字符串,此运算符应显示该字符串.如果是 数字,操作员应按该数量调整文本位置; 也就是说,它将转换文本矩阵Tm.该号码应为 以千分之一的文本空间单位表示.

Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space.

所以数字是字母之间距离的调整.

So the numbers are adjustments to the distance between the letters.

这篇关于iOS PDF到纯文本解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆