获取所有颜色为黑色的文本操作符,pdfBox [英] Get all text operators whose color is black, pdfBox

查看:43
本文介绍了获取所有颜色为黑色的文本操作符,pdfBox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解析已经存在的 pdf 时,我正在使用if(op.getOperation().equals("TJ")) 来获取文本操作符,我想要做的是只定位那些颜色为黑色(或其他指定颜色)的操作符.我无法在 pdfBox 文档中找到相同的方法.

基本上我想要做的是在pdf上只保留黑色文本,并删除/删除任何其他不符合条件的文本运算符.

谁能分享一个解决方案?

谢谢!

解决方案

文本显示运算符

<块引用>

在解析已经存在的 pdf 时,我使用 if(op.getOperation().equals("TJ")) 来获取文本操作符,

有更多显示操作符的文本一般您需要注意:

<块引用>

string Tj 显示文本字符串.

string ' 移动到下一行并显示一个文本字符串.该运营商应具有同代码T* string Tj

aw ac string " 移动到下一行并显示一个文本字符串,使用 aw 作为单词间距和 ac 为字符间距(设置文本状态下的相应参数).awac 应为以未缩放的文本空间单位表示的数字.该运算符应具有与此代码效果相同:aw Tw ac Tc string '

array TJ 显示一个或多个文本字符串,允许单独的字形定位.array 的每个元素要么是字符串,要么是数字.如果元素是字符串,则此运算符应显示字符串.如果是数字,则操作者将文本位置调整为该数量;也就是说,它应该翻译文本矩阵,Tm.该数量应以文本空间单位的千分之一表示(见 9.4.4,文本空间细节").这个数量应该从当前的水平或垂直坐标中减去,这取决于书写模式.在默认坐标系中,正调整具有将绘制的下一个字形向左或向下移动给定量的效果.

(Pdf 规范中的表 109 ISO 32000-1)

文字颜色

用于显示文本的颜色取决于当前的文本渲染模式.

<块引用>

文本呈现模式 Tmode 决定了显示文本是否会导致字形轮廓被描边、填充、用作剪切边界或三者的某种组合.

(Pdf 规范中的第 9.3.6 节 ISO 32000-1)

它是使用 Tr 运算符设置的:

<块引用>

render Tr 设置要渲染的文本渲染模式,Tmode,它应该是一个整数.初始值:0.

(Pdf 规范中的表 105 ISO 32000-1)

根据此模式,您必须考虑当前笔触颜色、当前填充颜色、稍后在定义的剪切边界中绘制的任何颜色或三者的某种组合.

颜色设置运算符在规范的表 74 中定义ISO 32000-1.

大多数情况下,仅填充字形轮廓(模式 0).因此,大多数情况下您必须考虑当前的填充颜色.这仍然需要考虑很多颜色设置命令.

此处最常使用灰色、RGB 或 CMYK 颜色.因此,大多数情况下,您必须检查grgk 运算符.

纯黑由0 g0 0 0 rg0 0 0 1 k设置.您可能还想考虑非常接近这些值的值;它们可能本来是黑色的,只是由于四舍五入问题而有所不同.

颜色变换

让事情变得更复杂一点:上面提到的颜色可能仍然会转换成一些完全不同的颜色,例如通过传递函数(参见第 10.4 节)、透明度或混合(参见第 11 节).

如果您还想考虑这些效果,您实际上需要编写自己的 PDF 渲染器.

不过,通常情况下,主要用于网络文本的 PDF 不使用这些功能.因此,出于您的目的,我一开始不会考虑它们.

While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators, What I want to do is to target only the ones whose color is black(or some other specifiable color). I am unable to find a method for the same in pdfBox docs.

Edit : Basically what I want to do is to keep only black colored text on the pdf, and remove/delete any other text operator which doesnt match the criteria.

Can anyone share a solution ?

Thanks !

解决方案

Text showing operators

While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators,

There are more text showing operators you have to take care of in general:

string Tj Show a text string.

string ' Move to the next line and show a text string. This operator shall have the same effect as the code T* string Tj

aw ac string " Move to the next line and show a text string, using aw as the word spacing and ac as the character spacing (setting the corresponding parameters in the text state). aw and ac shall be numbers expressed in unscaled text space units. This operator shall have the same effect as this code: aw Tw ac Tc string '

array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount.

(Table 109 in the Pdf specification ISO 32000-1)

Text color

The color used to show text depends on the current text rendering mode.

The text rendering mode, Tmode, determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three.

(section 9.3.6 in the Pdf specification ISO 32000-1)

It is set using the Tr operator:

render Tr Set the text rendering mode, Tmode, to render, which shall be an integer. Initial value: 0.

(Table 105 in the Pdf specification ISO 32000-1)

Depending on this mode you have to consider the current stroke color, the current fill color, the color of whatever is later-on painted in the defined clipping boundary, or some combination of the three.

The color setting operators are defined in Table 74 of the specification ISO 32000-1.

Most often the glyph outlines merely are filled (mode 0). Thus, most often you have to consider the current fill color. That still leaves quite a lot of color setting commands to consider.

Most often gray, RGB, or CMYK colors are used here. Thus, most often you will have to check the g, rg, or k operators.

Pure black is set by 0 g, 0 0 0 rg, or 0 0 0 1 k. You might also want to consider values which are very near to those values; they might have been intended as black and only differ due to rounding issues.

Color transformations

To make things a bit more complex: The colors mentioned above may still be transformed to some completely different color, e.g. by means of transfer functions (cf. section 10.4), transparency or blending (cf. section 11).

If you also want to consider these effects, you essentially program your own PDF renderer.

Normally, though, PDFs intended mainly for text on the web don't use these features. Thus, for your purposes I would not consider them at first.

这篇关于获取所有颜色为黑色的文本操作符,pdfBox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆