再次有来自PdfTextStripper的不可见文本 [英] Again having invisible text coming from PdfTextStripper

查看:714
本文介绍了再次有来自PdfTextStripper的不可见文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文件示例:文件.

问题-使用PdfTextStripper提取文本时,页面开头的"ASSETS"后面有标记"9/1/2017"和"387986",应将其删除,还有一些其他隐藏的标记.

我已经应用了解决方案(因此我不会在此处复制粘贴,因为实际上问题是完全相同的),而且隐藏的文本仍显示在页面上.除剪辑路径外,它是否可以被其他东西隐藏? 谢谢!

解决方案

除了剪辑路径外,它还能被其他东西隐藏吗?

是的.如果您使用的是新文档,则文字会以白色写成白色,例如ASSETS之后的387986绘制如下:

1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(@A,BAC)Tj 

初始1 1 1 rg将填充颜色设置为RGB白色. (此外,该文本很小,但是如果以例如黑色绘制,仍然可以看到.)

您所引用的解决方案是针对该问题中介绍的示例文档(如通过定义剪切路径(超出其边界)并填充路径(将其隐藏在下面),使不可见的文本变为不可见.因此,您的白色文本不会被其识别为​​隐藏的. /p>

不幸的是,与剪切或覆盖的文本相比,很难确定WHITE在WHITE文本上的隐身性,因为人们不仅需要了解当前图形状态的属性(如剪切路径),还需要删除其中的所有文本.给定路径,还需要在绘制文本之前就知道页面部分的颜色(以检查 onwhite 细节).

另一方面,如果您假定页面背景基本上是白色,则忽略所有白色文本非常简单:只需在processTextPosition中检测当前的填充颜色:

PDColor fillColor = gs.getNonStrokingColor();

,然后将其与您要视为不可见的WHITE的口味进行比较. (通常,与RGB,CMYK和Grayscale WHITE进行比较就足够了;在极少数情况下,您还必须正确解释更复杂的色彩空间.此外,您还可能会认为几乎WHITE色彩是不可见的(.99,.99 、. 99)RGB与WHITE几乎无法区分.)

如果发现当前颜色为白色,请忽略当前TextPosition.

但是请注意,就像您引用的解决方案一样,这还不是识别所有WHITE文本的最终解决方案:为此,您还必须检查文本呈现模式:如果只是填充(默认设置),上面的内容适用,但如果同时也是 stroking ,则还必须考虑描边颜色;如果将其渲染为不可见,则无需考虑颜色;并且如果文本呈现模式包括添加到剪切路径,则您必须等待并确定只要在剪切路径保持不变的情况下稍后在页面的此部分中将要绘制的内容,这绝对不是小菜一碟!

File example: file.

Problem - when extracting text using PdfTextStripper, there is token "9/1/2017" and "387986" after "ASSETS" in the page start which should be removed, and some others hidden tokens.

I have already applied this solution (so I do not copy-paste it here, because actually problem is exactly the same) and still that hidden text is appearing on page. Could it be hidden by something else except clip path? thanks!

解决方案

Could it be hidden by something else except clip path?

Yes. In case of your new document the text is written in white on white, e.g. the 387986 after ASSETS is drawn like this:

1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(@A,BAC)Tj 

The initial 1 1 1 rg sets the fill color to RGB WHITE. (Additionally that text is quite tiny but would still be visible if drawn in e.g. BLACK.)

The solution you refer to was implemented for documents like the sample document presented in that issue in which the invisible text is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, your white text won't be recognized by it as hidden.

Unfortunately recognizing invisibility of WHITE on WHITE text is more difficult to determine than that of clipped or covered text because one not only needs to know the a property of the current graphics state (like the clip path) or remove all text inside a given path, one also needs to know the color of the part of the page right before the text is drawn (to check the on WHITE detail).

If, on the other hand, you assume the page background to be essentially WHITE, it is fairly simple to ignore all white text: Simply also detect the current fill color in processTextPosition:

PDColor fillColor = gs.getNonStrokingColor();

and compare it to the flavors of WHITE you want to consider invisible. (Usually it should suffice to compare with RGB, CMYK, and Grayscale WHITE; in seldom cases you'll also have to correctly interpret more complex color spaces. Additionally you might also consider nearly WHITE colors invisible, (.99, .99, .99) RGB can hardly be distinguished from WHITE.)

If you find the current color to be WHITE, ignore the current TextPosition.

Be aware, though, just like the solution you referenced this is not yet the final solution recognizing all WHITE text: For that you'll also have to check the text rendering mode: If it is just filling (the default), the above holds, but if it is (also) stroking, you'll (also) have to consider the stroking color; if it is rendered invisible, there is no color to consider; and if the text rendering mode includes adding to path for clipping, you'll have to wait and determine what will be later drawn in this part of the page as long as the clip path holds, definitely not trivial!

这篇关于再次有来自PdfTextStripper的不可见文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆