使用iText遍历整个PDF并使用其中的某个对象更改某些属性 [英] Traverse whole PDF and change some attribute with some object in it using iText

查看:936
本文介绍了使用iText遍历整个PDF并使用其中的某个对象更改某些属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个过滤程序,它将每个黑色文本块转换为PDF文件中的灰色文本块。我已经浏览了com.itextpdf.text.pdf.parser并且找不到适合此功能的东西。

I'm working on a filter program which turns each black text block into gray ones in a PDF file. I have gone through com.itextpdf.text.pdf.parser and can't found something suitable for this function.

PS:
我正在使用iTextSharp 5.5.10,我找不到合适的文件。 iText5的文档似乎在大多数时候都有效,但仍然存在差异。是否有任何iTextSharp文件?

PS: I'm using iTextSharp 5.5.10, for which I can't find an appropriate document. Documents for iText5 seems to work at most times, but there's still difference. Is there any document for iTextSharp?

推荐答案

OP在评论中澄清了他的问题:

The OP clarified his question in a comment:


我想知道如何编写像 PdfTextExtractor 之类的解析器。我除了 BaseParser 之类的东西,但什么都没找到。所以我错过了我的方式。

I'm wondering how to write a parser like PdfTextExtractor or something else. I was excepting something like BaseParser or so but found nothing. So I missed my way about it.

如果你正在寻找类似编辑框架的东西,你可以使用 PdfContentStreamEditor 此答案中提供。

If you are in search for something like an editing framework, you can use the PdfContentStreamEditor presented in this answer.

基于< a href =https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/content/PdfContentStreamEditor.java =nofollow noreferrer> PdfContentStreamEditor 您可以像这样编辑PDF页面的内容流:

Based on the PdfContentStreamEditor you can edit the content stream of the PDF pages like this:

PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
PdfContentStreamEditor editor = new PdfContentStreamEditor()
{
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException
    {
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
        {
            if (currentlyReplacedBlack == null)
            {
                BaseColor currentFillColor = gs().getFillColor();
                if (BaseColor.BLACK.equals(currentFillColor))
                {
                    currentlyReplacedBlack = currentFillColor;
                    super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(1), new PdfNumber(0), new PdfLiteral("rg")));
                }
            }
        }
        else if (currentlyReplacedBlack != null)
        {
            if (currentlyReplacedBlack instanceof CMYKColor)
            {
                super.write(processor, new PdfLiteral("k"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfNumber(1), new PdfLiteral("k")));
            }
            else if (currentlyReplacedBlack instanceof GrayColor)
            {
                super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
            }
            else
            {
                super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfLiteral("rg")));
            }
            currentlyReplacedBlack = null;
        }

        super.write(processor, operator, operands);
    }

    BaseColor currentlyReplacedBlack = null;

    final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};

for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    editor.editPage(pdfStamper, i);
}

pdfStamper.close();

ChangeTextColor.java 测试 testChangeBlackTextToGreenDocument

PdfContentStreamEditor 为每条指令调用 write 方法通过重写此方法并将部分不同的指令转发到超类 write ,可以编辑流。

In PdfContentStreamEditor the method write is called for each instruction in the content stream and writes it back. By overriding this method and forwarding partially different instructions to the superclass write, one can edit the stream.

这个实现ntation展示了如何改变给定颜色的文本颜色。在这种情况下,黑色文本将更改为绿色。

This implementation shows how to change the color of text of a given color. In this case, black text is changed to green.

请注意,这仅仅是一个概念验证,而不是最终的完整解决方案。特别是

Beware, this is merely a proof-of-concept, not a final and complete solution. In particular


  • 如果颜色表达式<,则文本被视为黑色code> BaseColor.BLACK.equals(颜色)是 true ;由于 BaseColor 及其后代类之间的相等性并不完全明确,这可能会导致一些误报。

  • PdfContentStreamEditor 仅检查和编辑页面本身的内容流,而不是显示的表单xobjects或模式的内容流;因此,可能找不到某些文本。

  • Text is considered to be black if for its color the expression BaseColor.BLACK.equals(color) is true; as equality among BaseColor and its descendant classes is not completely well-defined, this might lead to some false positives.
  • PdfContentStreamEditor only inspects and edits the content stream of the page itself, not the content streams of displayed form xobjects or patterns; thus, some text may not be found.

改进类以正确检测黑色并递归遍历和编辑内容流使用过的模式和xobjects仍然是读者的练习。

Improving the class to properly detect black color and to recursively traverse and edit the content streams of used patterns and xobjects remains as an exercise for the reader.

这篇关于使用iText遍历整个PDF并使用其中的某个对象更改某些属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆