基于位置c ++从PDF文档提取文本 [英] Extract text from PDF document based on position c++

查看:228
本文介绍了基于位置c ++从PDF文档提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从基于其坐标的PDF文档中提取文本,所以我在)


这意味着,你必须考虑当前所有 cm 运算符,即自页面开始以来所有的内容,但通过恢复以前的图形状态而撤销的内容(参见操作符 q Q 推送和恢复图形状态,第8.4.2节,)

blockquote>

因此,坐标(x,y)在概念上是从文本空间坐标中乘以 Trm:


[x,y,1] = [xts,yts,1] x Trm


其中(xts,yts)在字形原点处为(0,0)。对于每个打印的字形,您都有一个字形位移,以到达下一个字形起点所在的位置:





文本矩阵应由这些字形位移值更新如下:




(第9.4.4节, ISO 32000-1:2008


当前PDF规范中的段落 ISO 32000-1:2008 。我收集这更喜欢使用PDF参考1.4,它是相当古老;此外,Adobe个人称之为非规范性。



编辑 在回答评论时的一些澄清



设备空间和用户空间,它们之间的区别是什么,不是设备空间参考打印机/视频显示?和用户空间,以克服每个设备的特性的方式?像用户页面是我看到的文档页面?



是的,设备空间是固定的坐标系统,基本上由设备的属性手。是,用户空间是独立于目标设备的坐标系。但不是,它不是您看到的文档页面,因为您在某些设备上看到它(或在某些设备处理之后)。



用户空间坐标系是一个独立的坐标系,其坐标点可以通过与当前变换矩阵(CTM)的矩阵乘法转换成设备坐标。


UserCoords x CTM = DeviceCoords


用户空间坐标系初始化为页面字典中的CropBox条目通过相应地初始化CTM来指定对应于可见区域的用户空间的矩形(见上文)。



的字已经指示(当前变换矩阵,坐标系被初始化),用户空间坐标系是动态的,不断变化的坐标系。


默认用户空间为PDF页面描述提供一致,可靠的起始位置,而不考虑使用的输出设备。如有必要,PDF内容流可以通过应用坐标变换运算符 cm 修改用户空间以更适合其需要(请参见8.4.4,图形状态运算符)。因此,在内容流中看起来是绝对坐标的内容相对于当前页面不是绝对的,因为它们在可以滑动并且收缩或扩展的坐标系中表示。坐标系统变换不仅增强了设备独立性,而且是其自身的有用工具。



(第8.3.2.3节,ISO 32000-1:2008


因此,当 PdfReader 矩阵M,CTM更改:


CTMnew = M x CTMold


并且根据此新矩阵CTMnew解释存在于以下运算符中的坐标:


UserCoords x CTMnew = DeviceCoords


现在用户空间坐标系可能与以前的状态非常不同,无论是缩放,旋转,倾斜,

您最有可能感兴趣的坐标是坐标系统中用户空间初始化为的坐标,即虚拟设备的设备坐标系,CTM初始化为身份矩阵。




$ b

文本的坐标在文本空间中指定。从文本空间到用户空间的转换由文本矩阵和图形状态中的多个文本相关参数(参见第9.4.2节文本定位操作符)来定义。



文本矩阵TM在文本对象开始时初始化为单位矩阵,但在文本操作执行期间发生变化,最明显的是当您使用 Tm 运算符,隐式当你使用别人。该矩阵由包含文本相关参数字体大小,水平缩放和文本上升的矩阵TR操纵。有关详细信息,请参阅上文的文本呈现矩阵TRM。因此,


DeviceCoords = UserCoords x CTM = TextCoords x TR x TM x CTM


从字形空间到文本空间的转换应由字体矩阵定义。对于大多数类型的字体,该矩阵应被预定义以将1000个单位的字形空间映射到1个单位的文本空间;对于Type 3字体,应在字体字典中明确给出字体矩阵(见第9.6.5节Type 3字体)。



此转换取决于当前字体。来自字体字典的字体矩阵FM将表现如下:


DeviceCoords = GlyphCoords x FM x TR x TM x CTM


您不想定位字形的单个段的设备坐标,因此这些坐标看起来不感兴趣。然而,字形宽度将在字形空间中解释。除非你处理类型3字体,这只是意味着你必须将它们除以1000 ...



参数w0和w1在字形画过程中演化?他们最初是(0,0)



w0和w1表示字形的水平和垂直位移。 w0是转换为文本模式的字形宽度(即,通常只除以1000),w1为0.对于垂直书写模式文本,检查 ISO 32000-1:2008



>文本空间与第一个字形空间有相同的起源吗?并且它们用计算的(tx,ty)更新



由于字形空间坐标仅乘以字体矩阵,和字体矩阵在所有情况下但对于类型3字体仅压缩因子1000,见上文,字形起源映射到文本空间起源。



但是tx和ty用于更新文本矩阵本身。因此,文本标识坐标系针对每个字形移动,并且对于每个(非类型3)字形原点映射到略微改变的文本空间坐标系的原点...。


I am trying to extract a text from a PDF document based on it's coordinates, so I have came across two notions in the Adobe PDF Reference (chap. 5.3):

  1. Text positioning operators
  2. Text showing operators

For now I am interested in Td & Tm positioning operators, while using Td you have tx and ty, relative to start of the current line which is clearly specified in a PDF document: tx ty Td, I have used this method to extract text by the tx and ty coordinates. The problem is that I don't know how to extract text from a PDF based on its position, while supplying only tx and ty.

a b c d e f Tm

this is the 'formula for' Tm usage. What does the a-f values represent ? This would be my input for Tm:

BT
/F1 8.88 Tf
0 0 0 rg
0.9998 0 0 1 401.52 448.08 Tm
[<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ
ET

Why does each group of four have a leading 00 ? is this in hex? should I convert it from hex to int and corresponding character?

this would be my input for Td:

BT 43.20 421.90 Td 0 Tw /C001 10.00 Tf 0.00 Tw <BlablablaTextInHexThatICanProcess>Tj ET

This is much clearer, the coordinates are clearer. How could extract the text from a Tm positioned PDF text object based on simple X and Y coordinates? I am using c++ and PoDoFo library

解决方案

First of all, when trying to extract text from a PDF based on its position, while supplying only tx and ty, it does not suffice to only consider the text matrix (which you set using the Tm operator you already found). You also have to consider the current transformation matrix!

I assume when you refer to a position as given in default user space coordinates.

To avoid the device-dependent effects of specifying objects in device space, PDF defines a device-independent coordinate system that always bears the same relationship to the current page, regardless of the output device on which printing or displaying occurs. This device-independent coordinate system is called user space.

The user space coordinate system shall be initialized to a default state for each page of a document. The CropBox entry in the page dictionary shall specify the rectangle of user space corresponding to the visible area of the intended output medium (display window or printed page). The positive x axis extends horizontally to the right and the positive y axis vertically upward

(section 8.3.2.3, ISO 32000-1:2008)

As we only see the x and y coordinates, we see the position as a vector (x, y) in R². Internally, though, PDFs consider this plane embedded in R³ with a constant z coordinate value 1, i.e. [x, y, 1]. This is because PDF wants to allow numerous kinds of transformations (translations, rotations, scaling, skewing, ...) but on the other hand wants to limit the required mathematical operations as far as possible. Incidentally after embedding our plane as [x, y, 1] into R³ all these transformations are possible by means of matrix multiplications:

Here you already see those numbers a, b, c, d, e, and f you asked about.

Now, before taking the text specific transformations into account, you have to take into account the manipulations of the current (text independent) transformation matrix. This matrix is manipulated by the cm operators:

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(section 8.4.4, ISO 32000-1:2008)

This implies, BTW, that you have to consider all cm operators currently in action, i.e. all presented since the start of the page content, with the exception of those revoked by restoring a former graphics state (cf. the operators q and Q pushing and restoring graphic states, section 8.4.2, ISO 32000-1:2008).

Only now you can consider the text specific transformation matrices:

At the beginning of a text object, Tm shall be the identity matrix; therefore, the origin of text space shall be initially the same as that of user space. The text-positioning operators, described in Table 108, alter Tm and thereby control the placement of glyphs that are subsequently painted. Also, the text-showing operators, described in Table 109, update Tm (by altering its e and f translation components) to take into account the horizontal or vertical displacement of each glyph painted as well as any character or word-spacing parameters in the text state.

Additionally, within a text object, a conforming reader shall keep track of a text line matrix, Tlm, which captures the value of Tm at the beginning of a line of text. The text-positioning and text-showing operators shall read and set Tlm on specific occasions mentioned in Tables 108 and 109

(section 9.4.2, ISO 32000-1:2008)

Thus, inside of a text object you have to keep track of the text matrix which primarily is set using the Tm operator you found with the operands arranged in the matrix as shown above but which also is manipulated as an effect of other text positioning and text showing operators.

And there still are additional parameters determining the final position of the text, the text state parameters Tfs (the text font size), Th (the horizontal scaling), and Trise (the text rise), cf. section 9.3.1, ISO 32000-1:2008.

Conceptually, the entire transformation from text space to device space [or in your case to the default user space] may be represented by a text rendering matrix, Trm:

Trm is a temporary matrix; conceptually, it is recomputed before each glyph is painted during a text-showing operation.

(section 9.4.2, ISO 32000-1:2008)

Thus, your coordinates (x, y) conceptually result from the text space coordinates by multiplication with Trm:

[x, y, 1] = [xts, yts, 1] x Trm

where (xts, yts) are (0, 0) at the glyphs origin. For every glyph printed you have a glyph displacement to get to the point where the next glyph origin will be positioned:

The text matrix shall be updated by these glyph displacement values as follows:

(section 9.4.4, ISO 32000-1:2008)

I quoted a number of paragraphs from the current PDF specification ISO 32000-1:2008. I gather this is preferable to using the PDF Reference 1.4 which es quite ancient; furthermore it has been called "not normative in nature" by Adobe personal.

EDIT Some clarifications in answer to comments

device space and user space, what is the distinction between them, isn't the device space reffering to printer/ video display? and user space to a way of overcoming every device's particularities? like the user page being the document page that I see?

Yes, the device space is a fixed coordinate system essentially determined by the properties of the device at hand. And yes, the user space is a coordinate system independant from the target device. But no, it is not "the document page you see" because you see it on some device (or after being processed by some device).

The user space coordinate system is an independent coordinate system the coordinates of of a point of which can be translated to the device coordinates by means of a matrix multiplication with the current transformation matrix (CTM).

UserCoords x CTM = DeviceCoords

The user space coordinate system is initialized to a state where the CropBox entry in the page dictionary specifies the rectangle of user space corresponding to the visible area (see above) by initializing the CTM accordingly.

But as the choice of words already indicates ("current transformation matrix", "the coordinate system is initialized"), the user space coordinate system is a dynamic, everchanging coordinate system.

The default user space provides a consistent, dependable starting place for PDF page descriptions regardless of the output device used. If necessary, a PDF content stream may modify user space to be more suitable to its needs by applying the coordinate transformation operator, cm (see 8.4.4, "Graphics State Operators"). Thus, what may appear to be absolute coordinates in a content stream are not absolute with respect to the current page because they are expressed in a coordinate system that may slide around and shrink or expand. Coordinate system transformation not only enhances device-independence but is a useful tool in its own right.

(section 8.3.2.3, ISO 32000-1:2008)

Thus, when a PdfReader stumbles upon a cm operator with its parameters representing some matrix M, the CTM changes:

CTMnew = M x CTMold

and coordinates present in following operators are interpreted according to this new matrix CTMnew:

UserCoords x CTMnew = DeviceCoords

So now the user space coordinate system might be very different from its former state, scaled, rotated, skewed, whatever.

The coordinates you are essentially interested in most likely are those in the coordinate system the user space is initialized as, i.e. the device coordinate system for a virtual device for which the CTM is initialized as identity matrix.

where does text space and glyph space start and end.

The coordinates of text are specified in text space. The transformation from text space to user space are defined by a text matrix in combination with several text-related parameters in the graphics state (see 9.4.2, "Text-Positioning Operators").

The text matrix TM is initialized as the identity matrix at the start of a text object but changes during the execution of text operations, most visibly when you use the Tm operator, implicitly when you use others. This matrix is manipulated by a matrix TR containing the text-related parameters font size, horizontal scaling, and text rise. For details see the text rendering matrix TRM above. Thus,

DeviceCoords = UserCoords x CTM = TextCoords x TR x TM x CTM

The transformation from glyph space to text space shall be defined by the font matrix. For most types of fonts, this matrix shall be predefined to map 1000 units of glyph space to 1 unit of text space; for Type 3 fonts, the font matrix shall be given explicitly in the font dictionary (see 9.6.5, "Type 3 Fonts").

Thus, this transformation depends on the current font. The font matrix FM from the font dictionary would act like this:

DeviceCoords = GlyphCoords x FM x TR x TM x CTM

You do not want to locate the device coordinates of a single segment of a glyph, so these coordinates do not seem to interest. The glyph widths, though, are to be interpreted in glyph space. Unless you are dealing with Type 3 fonts, though, this merely means that you have to divide them by 1000...

And how does parameters w0 and w1 evolve during glyph painting? are they initially (0,0)

w0 and w1 denote the glyph's horizontal and vertical displacements. In horizontal writing mode, w0 is the glyph widths transformed to text mode (i.e. most often merely divided by 1000) and w1 is 0. For vertical writing mode text inspect sections 9.2.4 and 9.7.4.3 in ISO 32000-1:2008.

does text space have the same origin as the first glyph space? and they get updated with the calculated (tx,ty)?

As the glyph space coordinates are merely multiplied by the font matrix to result in text space coordinates and the font matrix in all cases but for Type 3 fonts merely compresses by a factor of 1000, see above, the glyph origin is mapped to the text space origin.

But tx and ty are used to update the text matrix itself. Thus, the text spece coordinate system moves for each glyph and for each (non-Type 3) glyph origin maps to origin... of a slightly changed text space coordinate system.

这篇关于基于位置c ++从PDF文档提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆