用 Java 中的 PDFbox 替换或删除 PDF 中的文本 [英] Replace or remove text from PDF with PDFbox in Java

查看:1143
本文介绍了用 Java 中的 PDFbox 替换或删除 PDF 中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 PDFBOX 2.0 来替换空白或删除文本模式(在我的情况下,我想从所有 PDF 中删除所有[QR]"字词),但我找不到任何内容对我有用.

我试过 itext,但还是一样,没有任何效果.

[QR]"我的 pdf 中的字符串是在创建 PDF 后编辑的,也许这就是为什么它们不显示为 tj 运算符?

我的主要内容:

replaceText(documentoPDF, "[QR]", "");

我的方法(我打印了 Tj 值并且我的模式没有出现在那里):

public void replaceText(PDDocument documentoPDF, String searchString, String replacement) 抛出 IOException{for ( PDPage 页面:documentoPDF.getPages()){PDFStreamParser parser = new PDFStreamParser(page);parser.parse();列表<?>令牌 = parser.getTokens();for (int j = 0; j  0; k--) {以前的.remove(k);}}}}}//现在令牌已更新,我们将替换页面内容流.PDStream 更新流 = 新 PDStream(documentoPDF);OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);ContentStreamWriter tokenWriter = new ContentStreamWriter(out);tokenWriter.writeTokens(tokens);关闭();page.setContents(updatedStream);}documentoPDF.save("resources\\resultado\\nuevo.pdf");}

这是带有一些 [QR] 模式的 pdf 示例:

变成了

I'm trying to use PDFBOX 2.0 to replace empty or delete a text pattern, (in my case i want to remove all "[QR]" words from all PDF), but I can't find anything that works for me.

I tried itext, but the same, nothing works.

The "[QR]" string from my pdf were edited after the PDF was created, maybe that's why they don't appear as tj operators?

My main:

replaceText(documentoPDF, "[QR]", "");

My method (i printed Tj values and my pattern dont appear there):

public void replaceText(PDDocument documentoPDF, String searchString, String replacement) throws IOException{

    for ( PDPage page : documentoPDF.getPages()){
        
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List<?> tokens = parser.getTokens();
        
        for (int j = 0; j < tokens.size(); j++){
            
            Object next = tokens.get(j);
            if (next instanceof Operator){
                Operator op = (Operator) next;
                
                String pstring = "";
                int prej = 0;
                
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) 
                {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else 
                if (op.getName().equals("TJ")) 
                {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) 
                    {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) 
                        {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            
                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }                       
                    }                        
                    
                    System.out.println(pstring.trim());
                    
                    if (searchString.equals(pstring.trim())) 
                    {                            
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());                           

                        int total = previous.size()-1;    
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }                            
                    }
                }
            }
        }
        
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(documentoPDF);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);            
        out.close();
        page.setContents(updatedStream);
    }

    documentoPDF.save("resources\\resultado\\nuevo.pdf");
}

This is an example of pdf with some [QR] patterns: http://www.mediafire.com/file/9w3kkc4yozwsfms/file

If someone can help, i will appreciate it.

I can upload my entire project if you need

Thanks in advance.

解决方案

As already mentioned in comments, the reason why your code doesn't work is simple - you completely ignore the encoding of the font of that text. In the content stream there actually are [( >) ( 4) ( 5) ( @) ] TJ instructions (The "spaces" before '>', '4', '5', and '@' actually are zero bytes, 0x00). Thus, apparently the encoding is some 16bit encoding which additionally does not have ASCII naturally embedded.

To properly take the font into account one has to keep track of the current font. This means parsing the whole content stream and analyzing text font setting calls, save graphics state calls, and restore graphics state calls. Then you have to retrieve the proper font object from the correct resources.

All this actually is already done by the PDFBox content parsing framework used for e.g. text extraction. Thus, we can create a content stream editor around this framework.

Actually, this also has already been done, see the PdfContentStreamEditor from this answer.

As in case of your document the text pieces to delete are drawn by a single text drawing instruction each and each of these instructions draws only a text piece to remove, we can simply look at the text the current instruction draws and then decide whether to keep the instruction or not:

PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        final StringBuilder recentChars = new StringBuilder();

        @Override
        protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
                throws IOException {
            String string = font.toUnicode(code);
            if (string != null)
                recentChars.append(string);

            super.showGlyph(textRenderingMatrix, font, code, displacement);
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String recentText = recentChars.toString();
            recentChars.setLength(0);
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "[QR]".equals(recentText))
            {
                return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    editor.processPage(page);
}
document.save("nuevo-noQrText.pdf");

(EditPageContent test testRemoveQrTextNuevo)

In the result the "[QR]" texts underneath the QR codes have vanished, e.g.

became

这篇关于用 Java 中的 PDFbox 替换或删除 PDF 中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆