Apache PDFBox:编码问题 [英] Apache PDFBox: problems with encoding

查看:141
本文介绍了Apache PDFBox:编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF模板&试图替换其中的一些单词.我使用以下代码:

I have a PDF template & trying to replace some words in it. I use this code:

private PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
    if (searchString.isEmpty() || replacement.isEmpty()) {
        return document;
    }
    PDPageTree pages = document.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List<Object> tokens = parser.getTokens();
        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    if (searchString.equals(string)) {
                        System.out.println(string);
                    }
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            if (searchString.equals(string)) {
                                System.out.println(string);
                            }
                            string = StringUtils.replaceOnce(string, searchString, replacement);
                            cosString.setValue(string.getBytes());
                        }
                    }
                }
            }
        }
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        out.close();
    }
    return document;
}

我的PDF模板只有3个字符串:"file:///C/Users/Mi/Downloads/converted.txt","[10.03.2020 18:43:57]"和"hello !!!". 前两个字符串正确搜索,但第三个看起来像"KHOOR ...":

My PDF template have only 3 strings: "file:///C/Users/Mi/Downloads/converted.txt", "[10.03.2020 18:43:57]" and "hello!!!". First 2 strings searching correctly, but third looks like "KHOOR...":

据我了解,编码不匹配.当我尝试将"file:///C/Users/Mi/Downloads/converted.txt"替换为"Hello!"时,它替换为"ello",不显示大写字母和标记.据我了解,关键区别在于字体. "hello"具有字体设置,其他则没有.

There is an encoding mismatch, as I understand. When I try to replace "file:///C/Users/Mi/Downloads/converted.txt" with "Hello!", it replaces as "ello", not shows uppercases and marks. As I understand, key difference is in fonts. "hello" have font settings, others not have.

源PDF在这里: https://yadi.sk/i/l0OAcFkAkUHKYg

请提出建议,如何从PDF中获取文本作为正确的字符串并将其替换.

Please, advice, how to get text from PDF as correct strings and to replace it.

推荐答案

此答案实际上是一个解释,为什么针对您任务的通用解决方案至少非常复杂,即使不是不可能.在良性的情况下,例如,对于受特定限制的PDF,可以成功使用像您这样的代码,但是示例PDF显示您显然想要操纵的PDF不受此限制.

有许多因素阻碍自动替换PDF中的文本,一些因素已经使查找的指示难以绘制有关文本的说明,而一些因素使替换变得复杂>这些说明的参数中的字符.

There are a number of factors that impede automatic replacement of text in PDFs, some already making finding the instructions for drawing the text in question difficult, and some complicating the replacing the characters in the arguments of those instructions.

此处列出的问题列表并不详尽!

PDF包含内容流,这些内容流包含指令序列,这些指令序列告诉PDF处理器在哪里绘制内容.通过设置当前字体(和字体大小),设置绘制文本的位置以及实际绘制文本的说明来绘制PDF中的常规文本.这样可以很容易理解和搜索:

PDFs contain content streams which contain sequences of instructions telling a PDF processor where to draw what. Regular text in PDFs is drawn by instructions setting the current font (and font size), setting the position to draw the text at, and actually drawing text. This can be as easy to understand and search for as this:

/TT0 1 Tf
9 0 0 9 5 5 Tm
(file:///C/Users/Mi/Downloads/converted.txt[10.03.2020 18:43:57]) Tj 

(此处选择了大小为1的字体 TT0 ,然后进行仿射变换将文本缩放9倍,并移至位置(5,5),最后是文本绘制文件:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]" .)

(Here the font TT0 with size 1 is selected, then an affine transformation is applied to scale text by a factor of 9 and move to the position (5, 5), and finally the text "file:///C/Users/Mi/Downloads/converted.txt [10.03.2020 18:43:57]" is drawn.)

在这种情况下,搜索负责绘制给定文本的指令很容易.但是有问题的说明可能看起来也有所不同.

In such a case searching the instructions responsible for drawing a given piece of text is easy. But the instructions in question may also look differently.

例如,字符串可能是分段绘制的,而不是上面的 Tj 指令,我们可能有

For example the string may be drawn in pieces, instead of the Tj instruction above, we may have

[(file:///C/Users/Mi/Downloads/converted.txt)2 ([10.03.2020 18:43:57])] TJ

(首先绘制"file:///C/Users/Mi/Downloads/converted.txt" ,然后略微移动文本绘制位置,然后" [10.03.2020 18:43:57]" 都绘制在同一条 TJ 指令中.)

(Here first "file:///C/Users/Mi/Downloads/converted.txt" is drawn, then the text drawing position is slightly moved, then "[10.03.2020 18:43:57]" is drawn, both in the same TJ instruction.)

或者您可能会看到

(file:///C/Users/Mi/Downloads/converted.txt) Tj
([10.03.2020 18:43:57]) Tj 

(在不同的说明中绘制了文本部分.)

(The text parts drawn in different instructions.)

文本片段的顺序也可能是意外的:

Also the order of text pieces may be unexpected:

([10.03.2020 18:43:57]) Tj 
-40 0 Td
(file:///C/Users/Mi/Downloads/converted.txt) Tj

(首先绘制日期字符串,然后在绘制的日期之前将文本位置向左移一点,然后绘制URL.)

(First the date string is drawn, then the text position is moved left quite a bit before the drawn date, the the URL is drawn.)

某些PDF生产者分别绘制每个字符,并在以下之间设置整个文本转换:

Some PDF producers draw each character separately, setting the whole text transformation in between:

9 0 0 9 5 5 Tm
(f) Tj
9 0 0 9 14 5 Tm
(i) Tj
9 0 0 9 23 5 Tm
(l) Tj
...

这些不同的指令无需按此顺序排列,它们可以分布在整个流中,甚至可以分布在多个流中,因为页面可以具有内容流的数组而不是单个或一部分字符串.在从页面内容流引用的子对象的内容流中绘制.

And these different instructions need not be arranged in sequence as here, they can be spread over the whole stream, even over multiple streams as a page can have an array of content streams instead of a single one or part of the string may be drawn in the content stream of a sub-object referenced from the page content stream.

因此,要查找导致特定的多字符文本的说明,您可能必须检查多个流,并根据绘制位置将发现的字符串粘在一起.

Thus, for finding the instructions responsible for a specific, multi-character text, you may have to inspect multiple streams and glue the strings you found together according to the position they have been drawn at.

不是每个单个字符代码都可能对应于搜索字符串中的单个字符.对于字符的组合,有很多特殊的字形,例如表示fl等.因此,要进行搜索,必须扩展这种连字.

Not every single character code might correspond to a single character as in your search string. There are a number of special glyphs for combinations of characters like for fl etc. So for searching one has to expand such ligatures.

在上面的示例中,即使不是一次绘制文本,也易于识别文本的字符.但是在PDF中,字符的编码不必那么明显,实际上每种字体都可以带有自己的编码,例如

In the examples above, the characters of the text were easy to recognize even if the text was not drawn in a single run. But in PDFs the encoding of the characters need not be so obvious, actually each font may come with an own encoding, e.g.

<004B0048004F004F0052000400040004>Tj 

可以绘制"hello !!!" .

(此处将字符串参数写为十六进制字符串,在调试器中,您看到了"KHOOR ..." .)

(Here the string argument is written as hex string, in the debugger you saw "KHOOR...".)

因此,要搜索文本,首先需要根据当前字体的特定编码将文本绘制指令的字符串参数映射到Unicode.

Thus, for searching text, one needs to first map the string arguments of text drawing instructions to Unicode depending on the specific encoding of the current font.

但是PDF不需要包含从单个代码到Unicode字符的映射,仅在字体文件中可以有到字形id的映射.如果是嵌入式字体文件,则这些字体文件也不需要包含任何到Unicode字符的映射.

But the PDF does not need to contain a mapping from the individual codes to Unicode characters, there may only be a mapping to the glyph id in the font file. In case of embedded fonts files, these font files then don't need to contain any mapping to Unicode characters either.

通常,PDF文件中确实包含与代码匹配的Unicode字符的信息,以允许文本提取,例如复制/粘贴;但是严格来说,这些信息是可选的;更糟糕的是,这些信息可能包含错误,而在显示 PDF时不会造成问题.在所有这种情况下,必须使用类似OCR的机制来识别与每个字形关联的Unicode字符.

Often PDF files do have information on the Unicode characters matching the codes to allow text extraction e.g. for copy/paste; strictly speaking, though, such information is optional; even worse, that information may contain errors without creating issues when displaying the PDF. In all such situations one has to use OCR like mechanisms to recognize the Unicode characters associated with each glyph.

一旦找到了负责绘制搜索文本的说明,就必须替换文本.这也可能暗示了一些问题.

Once you found the instructions responsible for drawing the text you searched, you have to replace the text. This may also imply some problems.

如果将字体文件嵌入到PDF中,它们通常仅作为原始字体的子集嵌入,以节省空间.例如.在您的示例PDF中,Tahoma字体用于显示"hello !!!".仅嵌入了以下字形:

If font files are embedded in a PDF, they often merely are embedded as subsets of the original fonts to save space. E.g. in your example PDF the font Tahoma used to display "hello!!!" only is embedded with the following glyphs:

即使是新罗马字母(您可以识别文本的字体),也只是嵌入了以下字形的子集:

Even Times New Roman (the font used for the text you could recognize) is only subset embedded with the following glyphs:

因此,即使您找到了"hello !!!",在Tahoma中,只需将字符代码替换为"byebye ??"即可.只会显示"e e因为嵌入字体中出现字形的唯一字符是"e".

Thus, even if you found the "hello!!!" in Tahoma, simply replacing the character codes to mean "byebye??" would only display " e e " as the only character for which a glyph is present in the embedded font is the 'e'.

因此,要进行替换,您可能需要编辑嵌入式字体文件和代表PDF的字体对象以包含和编码所有必需的字形,或者添加另一种字体和说明以切换到该字体以用于可操作的文本绘图说明和此后再次返回.

Thus, to replace you may either have to edit the embedded font file and the representing PDF font object to contain and encode all required glyphs, or to add another font and instructions to switch to that font for the manipulated text drawing instructions and back again thereafter.

即使您的字体没有完全嵌入(因此将使用字体的完整本地副本)或没有嵌入所需的所有字形,字体使用的编码也可能受到限制.在基于西欧语言的PDF中,通常会找到 WinAnsiEncoding ,类似于Windows代码页1252的编码.如果要替换为西里尔字母,则这些字符没有字符代码.

Even if your font is not embedded at all (so your complete local copy of the font will be used) or embedded with all the glyphs you need, the encoding used for your font may be limited. In Western European language based PDFs you will often find WinAnsiEncoding, an encoding similar to Windows code page 1252. If you want to replace with Cyrillic text, there are no character codes for those characters.

因此,在这种情况下,您可能必须更改编码以包括所需的所有字符(通过扫描有问题的字体的所有使用来在当前编码中找到未使用的字符)或添加具有更适当编码的另一种字体.

Thus in this case you might have to change the encoding to include all the characters you need (by finding unused characters in the present encoding by scanning all uses of the font in question) or add another font with a more apropos encoding.

如果替换文本比替换文本长或短,并且PDF的同一行上还有其他文本,则必须决定是否也应移动该文本.它可能属于同一类,因此必须进行相应的移动,但也可以来自单独的文本块或列,在这种情况下,不应移动它.

If your replacement text is longer or shorter than the replaced text and there is other text following on the same line in the PDF, you have to decide whether that text should be moved, too, or not. It may belong together and has to be shifted accordingly, but it may alternatively be from a separate text block or column in which case it should not be moved.

文本对齐方式也可能会损坏.

Text justification may also be damaged.

还要考虑标记的文本(带下划线/删除线/背景颜色/...). PDF中的这些标记(通常)不是字体属性,而是单独的矢量图形.为了正确处理这些问题,您必须解析页面中的矢量图形和注释,试探性地识别文本标记,然后对其进行更新.

Also consider marked text (underline / strike through / background color / ...). These markings in PDF (usually) are not font properties but separate vector graphics. To get these right, you have to parse the vector graphics and annotations from the page, heuristically identify text markings, and update them.

如果您处理带标签的PDF(例如,为了可访问性),这可能会使查找文本更容易(因为可访问性应允许轻松地提取文本),但替换文本会更加困难,因为您可能还必须更新一些文本标签或结构树数据.

If you deal with tagged PDFs (e.g. for accessibility), this may make finding text easier (as accessibility should allow for easy text extraction) but replacing text harder because you may also have to update some tags or structure tree data.

如上所示,PDF中的文本替换存在很多障碍.因此,一个完整的解决方案(可能的话)远远超出了堆栈溢出答案的范围.不过,有些指针:

As shown above there are a lot of hindrances to text replacement in PDFs. Thus, a complete solution (where possible at all) is far beyond the scope of a stack overflow answer. Some pointers, though:

要查找要替换的文本,您应该使用PdfTextStripper(用于提取文本的PDFBox实用程序类),并将其扩展为具有指向文本绘制指令的所有文本分别绘制每个字符.这样,您就不必实现文本的所有解码和排序.

To find the text to replace you should make use of the PdfTextStripper (a PDFBox utility class for text extraction) and extend it to have all the text with pointers to the text drawing instruction that draws each character respectively. This way you don't have to implement all the decoding and sorting of the text.

要替换文本,您可以询问PDFBox字体类(如果相应扩展,由PdfTextStripper提供)是否可以对替换文本进行编码.

To replace the text you can ask the PDFBox font classes (provided by the PdfTextStripper if extended accordingly) whether they can encode your replacement text.

并且始终要拿出PDF规范的副本(ISO 32000-1或ISO 32000-2)...

And always have a copy of the PDF specification (ISO 32000-1 or ISO 32000-2) at your hands...

但是请注意,要获得一个不错的通用解决方案将花费您数周或数月的时间.

But do be aware that it will take you a while, a number of weeks or months, to get a somewhat decent generic solution.

这篇关于Apache PDFBox:编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆