PDFBox 2.0.4:XFA到文本错误 [英] PDFBox 2.0.4 : XFA to text error
问题描述
尝试将PDF(XFA)转换为字符串时出现以下错误.
当我从PDFBox 1.8.12
切换到PDFBox 2.0.4
I am getting the following errors while trying to convert PDF(XFA) to string.
These errors started coming when I switched from PDFBox 1.8.12
to PDFBox 2.0.4
这是日志
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 779916
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 780049
Mar 09, 2017 7:16:07 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 780074
java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 at offset 780074
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:951)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:866)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:150)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:274)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:207)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:854)
at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:772)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672)
at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
和
java.io.IOException: Wrong type of referenced length object COSObject{7, 0}: COSDictionary
at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:907)
at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:949)
at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:780)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741)
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:672)
at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:632)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:217)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:966)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:870)
我阅读了迁移文件,并使用load代替了loadNonSeq,因为现在PDFBox在内部进行了处理.
I read the migration and used load instead of loadNonSeq, because now PDFBox handles that internally.
有关如何解决这些错误的任何建议.
Any suggestions on how to fix these errors.
EDIT#2 @TilmanHausherr我检查了你的理论.我在Sublime中打开了文件,删除了开头的多余空格并保存了它.我收到以下错误
EDIT#2 @TilmanHausherr I checked your theory. I opened the file in Sublime, removed the extra spaces in the starting and saved it. I got the following error
org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56)
at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075)
at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348)
at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
at utils.PDFManager.PDFToText(PDFManager.java:280)
at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56)
at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:48)
at processing.Controller.getDocumentType(Controller.java:110)
at processing.Controller.insertIntoDb(Controller.java:43)
at Test.main(Test.java:203)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64)
... 19 more
Mar 09, 2017 11:07:22 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:56)
at org.apache.pdfbox.pdfparser.COSParser.parseXrefStream(COSParser.java:2075)
at org.apache.pdfbox.pdfparser.COSParser.parseXrefObjStream(COSParser.java:348)
at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:303)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:194)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:252)
at utils.PDFManager.PDFToText(PDFManager.java:280)
at processing.charge.CertificateUtils.getCertificateTypeFromFile(CertificateUtils.java:56)
at processing.charge.CertificateUtils.getCertificateType(CertificateUtils.java:49)
at processing.Controller.getDocumentType(Controller.java:110)
at processing.Controller.insertIntoDb(Controller.java:43)
at Test.main(Test.java:203)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:64)
也为了验证您的理论,我在Sublime中打开了另一个文件(可以正常工作),它具有相同的空格,制表符和CR.
Also to verify your theory, I opened another file(that was working correctly) in Sublime, and it had the same spaces, tabs and CRs.
推荐答案
如注释中所述,在PDF标头开始之前,文件具有空白(CR和TAB).您可以使用NOTEPAD ++(或使用任何可以编辑二进制文件的编辑器)或(如果所有文件都有此缺陷)通过编写打开打开输入流的简短代码来删除它们,吞下字节直到命中%"然后复制所有其余的从那里到输出流.
As discussed in the comments, the files have blanks (CRs and TABs) before the PDF header starts. You can remove them with NOTEPAD++ (or with any editor that can edit binary files), or (if all your files have that flaw) by writing a short code that opens an input stream, swallow bytes until you hit "%" and then copy all the rest from there to an output stream.
我还打开了问题 PDFBOX-3714 .
更新: 该问题已在2.0.5中修复,现已提供.
Update: This has been fixed in 2.0.5, now available.
这篇关于PDFBox 2.0.4:XFA到文本错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!