PDF规范-以磅为单位获取字体大小 [英] PDF Specification - Get Font Size in Points

查看:142
本文介绍了PDF规范-以磅为单位获取字体大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用C#编写PDF解析器,但是遇到一个不确定如何解释规范的问题.

除非另行指定,否则PDF文档中的用户空间为1/72英寸(即1pt).

Tf 运算符提供的缩放比例将字体从标准尺寸(通常为1个用户空间/1pt)缩放到正确的显示尺寸.

我具有以下页面内容:

  1 0 0 -1 0 792厘米q0 0612792对宽*q.75 0 0 .75 0 0厘米1 1 1 RG 1 1 1 RG/G0克0 0 816 1056重新F0 0 816 1056重新F0 0 816 1056重新F问问q0 0 612 791.25重新宽*q.75 0 0 .75 0 0厘米1 1 1 RG 1 1 1 RG/G0克0 0 816 1055重新F0 96816960关于F0 0 0 RG 0 0 0 RG英国电信/F0 21.33 Tf1 0 0 -1 0 140分钟96 0 Td< 0037>j13.0280762 0 Td< 004B>j11.8616943 0 Td< 004C>j4.7384338 0 Td< 0056>jET英国电信/F1 21.33 Tf1 0 0 -1 0 140分钟136.292267 0 Td< 0001>jET... 

我知道样本中定义的2个文本操作的字体大小为16pt,但是Tf运算符使用的字体大小为21.33.为了从这种字体大小转换回点,我打算使用cm运算符的比例(y)来设置点大小:

  21.33 * 0.75 = 15.9975 

但是,我在支持这种转换的PDF规范中找不到任何东西,我检查的所有库(PDFBox,iTextSharp,Spire PDF)都没有列出字体大小,只有21.33.

我应该使用CTM(由cm运算符定义)将字体大小缩放回正确的比例吗?或者这仅仅是纯粹的机会吗?

pdf文件在此处:解决方案

首先,您与其他文本提取器的比较是基于一种误解:

我检查过的所有库(PDFBox,iTextSharp,Spire PDF)都没有列出字体大小,除了21.33.

字体大小"所有这些库返回的参数只是 Tf 指令的size参数,而不是您试图确定的最终文档中观察到的有效字体大小.因此,与其他库进行比较是没有意义的.


现在,关于您的方法:

为了从该字体大小转换回点,我打算使用cm运算符的比例(y)来设置点大小:

  21.33 * 0.75 = 15.9975 

尽管某些库这样称呼它,但将第四个 cm 参数称为"scale(y)".有误导性.例如.如果文字旋转90°,则通常为null,而图形表示通常减小为零高度.

因此,仅使用标度(y)"就可以了.参数不起作用,您必须考虑整个转换.


最后,让我们讨论一下您的实际追求.

只要组合的变换矩阵(当前变换矩阵+文本矩阵+水平缩放)是正交的并且文本行遵循此正交性,那么字体大小概念的含义就很明显了.

但是,一旦在该组合矩阵中存在剪切,字体大小"的含义就会变大.不再明显了.

  • 您可能想说的是将原始垂直线(一个单位高)转换成的长度.
  • 您可能是指该转换后的线条在与转换后的字体基准线成直角的线上投影的长度.
  • 或者您可能是指该转换后的线在与观察到的基准线成直角的线上投影的长度.

使用简单的线性代数来计算前两个数字是微不足道的.第三个数字可能会更困难,因为您必须确定人类在生成的PDF中观察到的基线.在创新使用转换的情况下,这可能并非易事

I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.

Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).

The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.

I have the following page content:

1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...

I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:

21.33 * 0.75 = 15.9975

However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.

Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?

The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf

解决方案

First of all, your comparison with other text extractors is based on a misunderstanding:

none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.

The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.


Now, concerning your approach:

In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:

21.33 * 0.75 = 15.9975

While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.

Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.


Eventually let's discuss what you actually are after.

As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.

But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.

  • You might mean the length of what an originally vertical line (one unit high) is transformed into.
  • You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
  • Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.

The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial

这篇关于PDF规范-以磅为单位获取字体大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆