遍历整个PDF并将蓝色更改为黑色,并同时删除下划线(但仅从包含"http//"和"https//"的文本中)+ iText [英] Traverse whole PDF and change blue color to black and remove underlines as well ( But only from text which contains "http//" & "https//" ) + iText

查看:179
本文介绍了遍历整个PDF并将蓝色更改为黑色,并同时删除下划线(但仅从包含"http//"和"https//"的文本中)+ iText的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将文本的颜色从蓝色更改为黑色,并且还希望删除下划线.但是,只有那些包含"http//"和& "https//"

参考链接:

遍历整个PDF并将蓝色更改为黑色(也更改下划线的颜色)+ iText

遍历整个PDF并删除以下内容的下划线仅超链接(注释)+ iText

解决方案

为此任务提供解决方案的完整代码将超出堆栈溢出答案的范围.因此,我仅会概述此处是实现解决方案的一种方法.

心意

这项任务比人们可能意识到的要困难得多.

特别是,链接的文本不一定使用显示操作的几个连续文本绘制(更不用说单个文本了).在最坏的情况下,链接的每个字母都可以在单独的指令中绘制,所有这些指令以随机顺序分布在整个内容流中,并且操作之间会绘制非链接内容.

因此,您不能自己查看每个内容流指令,也不能像在问题中引用的先前方法那样,立即决定如何处理它.取而代之的是,您必须收集所有带有其上下文的文本和线条绘图指令,以页面顺序对其进行排序,在其中查找URL文本和附近的线条,操纵基础指令,然后写出页面内容.

此外,在引用的答案中对蓝色"的识别还不能涵盖所有的蓝色阴影.此处仅考虑RGB色彩空间蓝色,但其他色彩空间也可能会产生蓝色.同样,文本最初可能会以不同的颜色绘制,并通过一些叠加进行更改.此外,这些色彩空间不必一定包含黑色.因此,与仅在识别的链接文本段和行之前更改颜色值相比,用于一般解决方案的基本指令的操作更加困难.

一种实现方法

仍然可以基于 this this )是从

要也认识到创建蓝色文本的更疯狂的方法,您必须进一步改进对说明的分析.例如.如果在混合模式下变亮,后来用蓝色填充了包含某些文本的区域,则原本为黑白的文本会突然变为蓝白色.

可能的概括

如果您以某种方式公开了排序的文本块并创建了一个更灵活的接口,其中包含一些可应用于基础指令的更改的方法,那么这种方法实际上将产生一个更通用的PDF文本操纵器.

对于上述方法而言,要可靠地实施,将需要花费数周的时间,您可能需要考虑使用这种更通用的体系结构,以备日后重用和共享.

I want to change the color of text from blue to black and also wants to remove underline as well. But from only those text which contains "http//" & "https//"

Refrence Links:

Traverse whole PDF and change blue color to black ( Change color of underlines as well) + iText

Traverse whole PDF and Remove underlines of hyperlinks (annotations) only + iText

解决方案

Presenting the complete code of a solution for this task would be beyond the scope of a stack overflow answer. Thus, I'll merely outline here one approach to implement a solution.

Hindrances

The task is more difficult than one might be aware of.

In particular the text of a link is not necessarily drawn using a few consecutive text showing operations (let alone a single one). In the worst case each letter of the link could be drawn in a separate instructions with all these instructions spread in a random order all over the content stream with operations drawing non-link content in-between.

Thus, you cannot look at each content stream instruction by itself and decide immediately what to do with it as was possible in the previous approaches you referenced in your question. Instead you'll have to collect all text and line drawing instructions with their context, sort them in the on page order, find URL texts and nearby lines there-in, manipulate the underlying instructions, and then write out the page content.

Furthermore, the recognition of "blue" in the referenced answers will not yet catch every shade of blue; only RGB colorspace blues are considered there but a blue tint might be generated by other color spaces, too. Also the text may be initially drawn in a different color and have it changed by some overlay. Furthermore, these colorspaces need not necessarily contain a black tint. Thus, the manipulation of the underlying instructions for a general solution is more difficult than simply changing the color value before the recognized link text pieces and lines.

An implementation approach

A solution taking those hindrances into account can still be built based on the PdfCanvasEditor used in the referenced answers (this and this) borrowed from this answer. In contrast to solutions there, though, the instructions must be collected in the write method with some relevant information of the state at the time of their execution, in particular the text and text position for text drawing instructions and the line position for line drawing instructions, and the color.

The iText LocationTextExtractionStrategy already does that, merely without keeping the original instructions in mind. Thus, you can borrow code from that strategy or even integrate it (instead of the dummy render listener by default used in the PdfCanvasEditor) and merely have to reference the corresponding instructions from the text chunks processed by the strategy class.

When all the instructions of the page have been collected with those extra information, you have to sort the text. The LocationTextExtractionStrategy also contains code to sort the text chunks accordingly which you can now use for your task.

In those sorted text chunks you can now look for link texts. Having found them, you can visit all the text drawing instructions associated with those chunks and all the line drawing instruction with positions right under those chunks, check their color for blueness, and (if blue) envelop them in a "change to black color" and "change back to previous color again" instructions bracket.

To also recognize wilder ways to create blue text, you have to improve your analysis of the instructions even more. E.g. if in blend mode Lighten later an area including some text is filled in blue, an originally black-on-white text suddenly becomes blue-on-white.

A possible generalization

This approach actually would give rise to a more generic PDF text manipulator if you somehow exposed the sorted text chunks and created a more flexible interface with methods for a number of changes to apply to the underlying instructions.

As of the approach above will take quite a number of weeks for a solid implementation anyways, you may want to consider such a more generic architecture for possible later re-use and sharing.

这篇关于遍历整个PDF并将蓝色更改为黑色,并同时删除下划线(但仅从包含"http//"和"https//"的文本中)+ iText的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆