使用Itext替换PDF文件中的字符串,但不替换字母X. [英] Replace string in PDF file using Itext but letter X not replace

查看:141
本文介绍了使用Itext替换PDF文件中的字符串,但不替换字母X.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用一个文本替换 PDF 的内容,但字母'X'没有被替换。

I'm trying to replace the content of PDF in one text but the letter 'X' are not being replaced.

public static void main(String[] args) {

    String DEST = "/home/diego/Documentos/teste.pdf";

    try {
        PdfReader reader = new PdfReader("termoAdesaoCartao.pdf");
        PdfDictionary dictionary = reader.getPageN(1);
        PdfObject object = dictionary.getDirectObject(PdfName.CONTENTS);
        if (object instanceof PRStream) {
            PRStream stream = (PRStream)object;
            byte[] data = PdfReader.getStreamBytes(stream);
            stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
        }
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
        stamper.close();
        reader.close();
    } catch (IOException | DocumentException e) {
        e.printStackTrace();
    }

}

推荐答案

一般



基本上是OP的一般的方法无法正常工作。他的代码有两个主要的误解:

In general

Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:


  • 他假设可以从<$ c翻译完整的内容流$ c> byte [] 到字符串(使用单个字符编码显示运算符清晰的文本的所有字符串参数)。

  • He assumes that one can translate a complete content stream from byte[] to String (with all string parameters of text showing operators being legible) using a single character encoding.

这个假设是错误的:每个字体可能有自己的编码,因此如果在同一页面上使用多个字体,则显示运算符的不同文本的字符串操作数中的相同字节值可能完全代表不同的人物。实际上字体甚至不需要包含到字符的映射,它们只需要将数值映射到字形绘制指令。

This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.

Cf。第9.4.3节文本显示操作符 =noreferrer> ISO 32000-1

Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:


文本显示运算符的字符串操作数应解释为标识要绘制的字形的字符代码序列。

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

使用简单的字体,字符串的每个字节都应被视为单独的字符代码。然后在字体的编码中查找字符代码以选择字形,如9.6.6字符编码中所述。

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

使用复合字体(PDF 1.2),可以使用多字节代码来选择字形。在这种情况下,字符串的一个或多个连续字节应被视为单个字符代码。代码长度和从代码到字形的映射在称为CMap的数据结构中定义,

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,

简单的PDF生成器通常仅使用标准编码(这些是ASCII'并且可能会产生类似OP的假设)但是有越来越多的非简单PDF生成器...

Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...

他假设他可以简单地编辑文本显示操作符的字符串操作数,匹配的字形将显示在PDF查看器中。

He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.

这个假设是错误的:字体通常仅支持相当有限的字符集,并且显示运算符的文本仅使用单个字体,即当前选定的字体。如果一个操作符的字符串参数中的代码替换了另一个没有字体匹配字形的操作符,那么最多只能看到一个间隙!

This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!

完成后字体通常至少包含所有字符的字形(例如,带有所有西欧变体的拉丁字母),PDF允许部分嵌入字体,参见9.6.4 字体子集 ISO 32000-1

While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:


PDF文档可能包含Type 1和TrueType字体的子集。

PDF documents may include subsets of Type 1 and TrueType fonts.

此选项同时通常仅用于嵌入现有文本中实际使用的字形的绘制指令。因此,如果嵌入字体包含一些相同类型的字符,则不能指望它们。可能有 A C 的字形,但不包括 B

This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A and C but not for B.

不幸的是,OP没有提供他的样本PDF。但症状是:

Unfortunately the OP has not supplied his sample PDF. The symptoms , though:


  • 他的电话替换(Nome Completo,ABCDEFGHIJKLMNOPQRSTU-VWXYZ )有所作为,如他的截图所示

  • his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z") makes a difference as can be seen in his screenshot


以及他对Viacheslav Vedenin的回答的评论

and his comment to Viacheslav Vedenin's answer


在文本(Nome Completo)Tj 之前和之后(ABCDEFGHIJKLMNOPQRSTU) -VWXYZ)Tj


  • 但是有些代码没有显示为预期的字形,也可以显示在上面的屏幕截图中可以看到

  • but some codes do not show as the expected glyphs as can also be seen in the screenshot above

    指向上述两个主要错误假设的后一个方向OP的代码使他失败:很可能有问题的字体使用标准编码(可能是 WinAnsiEncoding ),但只是部分嵌入,特别是机智hout大写字母 K W X ,以及 Y

    point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K, W, X, and Y.

    OP(已经使用iText)可以使用以下iText概念,而不是盲目地编辑内容流:

    Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:


    • 文本提取类可以用于提取文本的坐标,cf stackoverflow上的多个答案,特别是他想要替换的文本的边界矩形;

    • iText xtra库类 PdfCleanUpProcessor 可用于删除该边界矩形中存在的所有内容;

    • PdfStamper.getOverContent()然后可以用来在这些坐标处正确添加新内容。

    • text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
    • the iText xtra library class PdfCleanUpProcessor can be used to remove all content existing in that bounding rectangle;
    • the PdfStamper.getOverContent() can then be used to properly add new content at those coordinates.

    这可能听起来很复杂,但这需要额外注意一些额外的内容。在OP的方法中可以看到轻微的误解。

    This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.

    这篇关于使用Itext替换PDF文件中的字符串,但不替换字母X.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆