删除其他运算符,仅保留文本运算符(TJ,Tj)pdfBox [英] Remove other operators and keep only Text operators(TJ,Tj) pdfBox

查看:272
本文介绍了删除其他运算符,仅保留文本运算符(TJ,Tj)pdfBox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pdf文件,我希望从其中删除所有图像和其他绘图内容. 并将结果另存为新的pdf.

I have a pdf from which I wish to remove all the image and other drawing content from it. and save the resultant as a new pdf.

我知道如何使用TJ和Tj运算符删除文本,当前操作如下

I know how to remove text by using TJ , Tj operators , which I currently perform as below

op.getOperation().equals( "TJ")

不是删除TJ,Tj运算符,而是可以将这些Text运算符复制到其他具有完整格式的pdf文件中,从而使新的pdf变为纯文本pdf吗? 如果使用Tj以外的其他文字绘制文本,则TJ运算符会丢失,没关系.

Instead of removing the TJ,Tj operators , Is it possible to copy these Text operators onto an other pdf file with formatting intact so that the new pdf turns out to be pure text only pdf ? Its ok if text drawn using other than Tj , TJ operator misses out.

删除TJ,Tj的代码来自 stackoverflow帖子.但是它可以部分工作,它仅删除图像,而保留绘画和其他艺术品.

Code to remove TJ,Tj is taken from THIS stackoverflow post. But it partially works , it just removes images only, leaving drawing and other art intact.

我可以想到的另一种选择是将BT ET块以外的所有其他运算符的cmyk颜色设置为白色.这样,pdf只会感觉文本.这可能吗 ?如果是,请在pdfBox中提供代码示例.

EDIT : Other option I can think of is to set the cmyk color of all other operators outside the BT ET block to white. this way the pdf would feel text only. Is this possible ? If yes, Please support with code examples in pdfBox.

推荐答案

... stackoverflow帖子.但是它可以部分工作,它仅删除图像,而保留绘画和其他艺术品.

... THIS stackoverflow post. But it partially works , it just removes images only, leaving drawing and other art intact.

除位图图形外,图形的主要来源是矢量图形.它们通常由路径定义组成,后跟填充或抚摸路径的命令.

The main source of graphics other than bitmap graphics is vector graphics. They usually consist of path definitions followed by commands filling or stroking the path.

要删除这些图形,您可以通过用 n 运算符(是 path-painting no-op)另外替换那些路径敲击或填充运算符来改善所参考答案中的样本.

To remove these graphics you can improve the sample from the answer you referred to by additionally replacing those path striking or filling operators by the n operator which is a path-painting no-op.

            if( token instanceof PDFOperator )
            {
                PDFOperator op = (PDFOperator)token;
                if( op.getOperation().equals( "Do") )
                {
                    //remove the one argument to this operator
                    COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                    continue;
                }
                else if (PAINTING_PATH_OPS.contains(op.getOperation()))
                {
                    // replace path painting operator by path no-op
                    token = PDFOperator.getOperator("n");
                }
            }

其中

final static List<String> PAINTING_PATH_OPS = Arrays.asList("S", "s", "F", "f", "f*", "B", "b", "B*", "b*");

包含敲击或填充运算符的路径.

contains the path striking or filling operators.

PS:该参考答案中使用的图像删除代码有两个缺点:

PS: The image removal code used in that referred-to answer has two drawbacks:

  • 它删除过多,因为它不仅删除了图像xobject,而且还删除了xobject.有时(尤其是在n-up工具输出中)所有内容都位于xobjects形式的内部,包括所有文本.

  • It removes too much because it not only removes image xobjects but also form xobjects; sometimes (especially in n-up tool outputs) all content resides inside such form xobjects, including all text.

要解决此问题,您必须检查所引用的xobject的类型,并且仅在其具有子类型图像时才将其删除.由于表单xobjects也可以包含图像,因此您必须递归到表单xobject(它具有自己的内容流).

To fix this you have to check the type of the referred-to xobject and only remove it if it has sub-type image. As form xobjects in turn can also contain images, you have to recurse into the form xobject (which has a content stream of its own).

它删除得太少了,因为它会忽略内联图像.

It removes too little because it ignores inlined images.

要解决此问题,您还必须注意 BI 键值对 ID 图像数据 EI 部分,然后将其删除.

To fix this you also have to look out for BIKey-value pairsIDImage dataEI sections in the content and remove them.

这篇关于删除其他运算符,仅保留文本运算符(TJ,Tj)pdfBox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆