如何在Java中的Apache PDFBox lib中打开和替换PDF流中的数据? [英] How open and replace a data from PDF stream in the apache PDFBox lib in java?

查看:186
本文介绍了如何在Java中的Apache PDFBox lib中打开和替换PDF流中的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Java代码(java 1.6)中使用 apache pdfbox 2.0.0 版本. 我试图弄清楚如何从

I use apache pdfbox 2.0.0 version in my java code (java 1.6). I'm trying to figure out how I can get, replace and save back to my pdf a data from

<stream> data here... <endstream> ?

我的pdf文件如下:

596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ­[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н"іXPJГ ...
endstream
endobj

我找到了一种解决方案,可以解码此流.我使用了pdfbox-app-1.8.10.jar api中的"WriteDecodedDoc"命令. 因此,现在我有文件的两个变体,但是我不知道如何使用此流. 该流包含页脚和页眉,其中放置了图像和文本.

I've found a solution how I can decode this stream. I used a "WriteDecodedDoc" command from the pdfbox-app-1.8.10.jar api. So now I have two variant of the file but I have NO idea how I can work with this stream. This stream contains footer and header where were placed images and text.

我使用PDFTextStripper类检查了我的文件.它可以从流中看到必要的数据,但是在替换并将数据保存回pdf文件的情况下,我不能使用此类.

I checked my file with PDFTextStripper class. It can see necessary data from streams but I can't use this class in case of replacement and saving data back to pdf file.

我试图替换此文本,只是打开一个文件作为文本,搜索文本,仅在流中替换它并保存.但是我对无法提取嵌入的字体..."有疑问.主要原因是我松了编码.我尝试更改此编码,但对我没有帮助.

I tried replace this text just open a file as text, search text, replace it only in stream and save. But I have a problem with "Cannot extract the embedded font...". The main reason is that I loose an encoding. I tried change this encoding but it didn't help me.

顺便说一句,我不能使用iText.我应该在这里使用免费的库.

BTW I can't use iText. I should use free libs here.

感谢您提出任何解决方案.

Thanks for any solution.

解码后,我得到了类似的流

after decoding I have the stream like

stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
endstream

我需要将链接替换为流中的其他链接.这个:

I need to replace a link to a different link inside stream. This one:

[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ

编辑2 代码

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // COSBase cosb = document.getDocument().getObjects().get(27);
            // e.g. this object contains <stream> bytecode <endstream> in the PDF file.
            // it looks that
            // document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
            // document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section

            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.linkhouldbehere.com")) {
                                // some magic here to remove all indents and show new link from beginning.
                                // no rules. Just for test and it works here
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding of date from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    newTokens.add(token);
                }

                // save replaced content inside a page
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();
                page.setContents(newContents);

                // replace all links that have a pop-up line
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }
            // save file
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

编辑3.

pdf包含660 0 obj,该obj内部包含必要的链接:

The pdf contains the 660 0 obj that contains a necessary link inside:

660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC 
BT
/CS0 cs 0.412 0.416 0.423  scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
EMC 
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(\251 2015 )29(CCH Incorporated and its af\037liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC 

endstream

,只有一个地方从pdf文件中找到了该地方.从45 0 obj

and only one place I found where it is called from the pdf file. It is from 45 0 obj

/XObject <<
    /Fm0 660 0 R
    /Fm1 661 0 R
>>

obj的全文:

45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj

一个问题是,我可以获取这个660 0 obj并由PDFBox处理吗?因为看起来PDFStreamParser解析器不知道有关此660 0对象的任何信息. 谢谢.

A question is Can I get this 660 0 obj and process it by PDFBox? Because it looks like PDFStreamParser parser doesn't know anythig about this 660 0 object. Thank you.

推荐答案

对于 PDFBox 2.0.0-SNAPSHOT . 这是我的代码,在替换链接的情况下对我来说很好用.

For PDFBox 2.0.0-SNAPSHOT. This is my code that works fine for me in case of a links replacement.

非常感谢Tilman Hausherr的帮助.

Thanks a lot Tilman Hausherr for his help.

String filePath = "d:\\pdf\\file1.pdf"

...

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            // Decrypt a document
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // replace all links in a footer and a header in XObjects with /ProcSet [/PDF /Text]
            // Note: these forms (and pattern objects too!) can have resources,
            // i.e. have Form XObjects or patterns again.
            // If so you need to use a recursion
            for (int pageNum = 0; pageNum < document.getPages().getCount(); pageNum++) {
                List<Object> newPdxTokens = new ArrayList<Object>();
                // Get all XObjects from the page
                Iterable<COSName> xobjs = document.getPage(pageNum).getResources().getXObjectNames();
                for (COSName xobj : xobjs) {
                    boolean isHasTextStream = false;
                    PDXObject pdxObject = document.getPage(pageNum).getResources().getXObject(xobj);
                    // If a stream has not '/ProcSet [/PDF /Text]' line inside it has to be skipped
                    // isXobjectHasTextFieldInPdf has a recursion
                    if (pdxObject.getCOSObject() instanceof COSDictionary) {
                        isHasTextStream = isXobjectHasTextFieldInPdf((COSDictionary) pdxObject.getCOSObject());
                    }

                    if (pdxObject instanceof PDFormXObject && isHasTextStream) {
                        // Set stream from pdxObject
                        PDStream stream = pdxObject.getStream();
                        PDFStreamParser streamParser = new PDFStreamParser(stream.toByteArray());
                        streamParser.parse();
                        for (Object token : streamParser.getTokens()) {
                            if (token instanceof Operator) {
                                Operator op = (Operator) token;
                                if (op.getName().equals("Tj")) {
                                    // Tj contains 1 COSString
                                    COSString previous = (COSString) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = previous.getString();
                                    // here can be any filters for checking a necessary string
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                } else if (op.getName().equals("TJ")) {
                                    // TJ contains a COSArray with COSStrings and COSFloat (padding)
                                    COSArray previous = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = "";
                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement instanceof COSString) {
                                            COSString cosString = (COSString) arrElement;
                                            String content = cosString.getString();
                                            string += content;
                                        }
                                    }
                                    // here can be any filters for checking a necessary string
                                    // check if string contains a necessary link
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    } else if (string.startsWith("www.testlink.com")) {
                                        // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                        COSArray newLink = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                        int size = newLink.size();
                                        float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                        for (int i = 0; i < size - 4; i++) {
                                            newLink.remove(0);
                                        }
                                        newLink.set(0, new COSString("test.test.com"));
                                        // number for indenting from right place. Should be checked.
                                        newLink.set(1, new COSFloat(f - 8000));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                }
                            }
                            // save tokens to a temporary List
                            newPdxTokens.add(token);
                        }
                        // save the replaced data back to the srteam
                        OutputStream out = stream.createOutputStream();
                        ContentStreamWriter writer = new ContentStreamWriter(out);
                        writer.writeTokens(newPdxTokens);
                        out.close();
                    }
                }
            }

            // replace data from any text stream from pdf. XObjects not included.
            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                // Get all tokens from the page
                List<Object> tokens = parser.getTokens();
                // Create a temporary List
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.testlink.com")) {
                                // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    // save tokens to a temporary List
                    newTokens.add(token);
                }
                // save the replaced data back to the document's srteam
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();

                // save content
                page.setContents(newContents);

                // replace all links that have a pop-up line (It does not affect the visible text)
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }

            // save document
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

仅处理文本流并跳过图像流的另一种方法.从主要方法"replaceLinksInPdf(String filePath)"中调用它

an extra method to process only Text stream and skip an image stream. It is called from the main method "replaceLinksInPdf(String filePath)"

        // Check if COSDictionary has '/ProcSet [/PDF /Text]' string in the stream
        private static boolean isXobjectHasTextFieldInPdf(COSDictionary dictionary) {
            boolean isHasTextField = false;
            for (COSBase cosBase : dictionary.getValues()) {
                // go to a recursion because COSDictionary can have COSDictionaries inside
                if (cosBase instanceof COSDictionary) {
                    COSDictionary cosDictionaryNew = (COSDictionary) cosBase;
                    // check if '/ProcSet' has '/Text' param
                    if (cosDictionaryNew.containsKey(COSName.PROC_SET)) {
                        COSBase procSet = cosDictionaryNew.getDictionaryObject(COSName.PROC_SET);
                        if (procSet instanceof COSArray) {
                            for (COSBase procSetIterator : ((COSArray) procSet)) {
                                if (procSetIterator instanceof COSName
                                        && ((COSName) procSetIterator).getName().equals("Text")) {
                                    return true;
                                }
                            }
                        } else if (procSet instanceof COSString && ((COSString) procSet).getString().equals("Text")) {
                            return true;
                        }
                    }
                    // go to the COSDictionary children
                    isHasTextField = isXobjectHasTextFieldInPdf(cosDictionaryNew);
                }
            }
            return isHasTextField;
        }

这只是我的项目的测试变体.我将使用项目规则重构此代码.您应根据需要更换替换件.另外,我正在使用这个 PDFBox 2.0.0 库大约1周,也许每个人都可以找到更简单的方法来编写代码.随时进行代码审查,并发布更合适的版本.谢谢.

It just a testing variant for my project. I will refactor this code with project's rules. You should change replacements as you need. Also I am using this PDFBox 2.0.0 lib about 1 week and maybe anyone can find more easy way to do some code. Feel free to make a code review and post a more appropriate variant. Thanks.

P.S.我已经测试了40个PDF,只有其中2个必须进行深度处理以防递归.除链接外,所有40个文件均可打开,可读,与以前的版本相同

P.S. I've tested on it 40 PDFs and only 2 of them have to be processed deeply in case of recursion. All 40 files can be open, readable, look as previous version except links

这篇关于如何在Java中的Apache PDFBox lib中打开和替换PDF流中的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆