使用iTextSharp提取文本会引发InvalidCastException [英] Extracting text with iTextSharp throws an InvalidCastException
问题描述
我目前正在使用iTextSharp从PDF文件中提取文本.
I'm currently extracting text from a PDF file with iTextSharp.
使用数十个PDF可以正常工作,但是其中2个PDF抛出无效的强制转换,除了Stacktrace在[1]上.
With dozens of PDFs it works fine however 2 of the PDFs throws an invalid cast exceptio Stacktrace at [1].
引发此异常的代码如下(该异常引发GetTextFromPage):
The code which throws this exception is the following (the exception throws at GetTextFromPage):
PdfReader reader = new PdfReader(byteArray);
PdfTextExtractor.GetTextFromPage(reader, 1, new SimpleTextExtractionStrategy());
一些附加说明:
- Adobe Acrobat中的预检语法检查未发现任何错误.
- 产生此错误的示例PDF位于: http://resources.mpi-inf .mpg.de/DisparityModel (本文(Adobe Acrobat PDF,6.69 MB).)
- 我已经尝试过LocationTextExtractionStrategy-相同的错误.
- The Preflight Syntax check in Adobe Acrobat doesn't find any errors.
- A sample PDF which generates this error is located at: http://resources.mpi-inf.mpg.de/DisparityModel (The paper (Adobe Acrobat PDF, 6.69 MB). )
- I tried already the LocationTextExtractionStrategy - same error.
如何在预检旁检查PDF文件是否已损坏?还是这个错误来自哪里?
How can I check the PDF file if it is corrupt, beside the Preflight? Or where does this error come from?
[1]
System.InvalidCastException was unhandled
HResult=-2147467262
Message=Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'.
Source=itextsharp
StackTrace:
at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
at iTextSharp.text.pdf.DocumentFont.Init()
at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
at ConsoleApplication1.Program.Main(String[] args) in e:\foobar\projects\AnalyzePDF\ConsoleApplication1\ConsoleApplication1\Program.cs:line 24
at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
InnerException:
推荐答案
所讨论的文档包含具有以下 ToUnicode 映射的字体:
The document in question contains a font with the following ToUnicode map:
/CIDInit /ProcSet findresource
begin
12 dict
begin
/CIDSystemInfo <</Ordering (UCS) /Registry (Adobe) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffffffffffffffff> endcodespacerange
20 beginbfchar
<0003> <0020> <0012> <0043> <0018> <0044> <0045> <004e> <0059> <0051> <005e> <0053> <0102> <0061> <0110> <0063> <011a> <0064> <011e> <0065> <015d> <0069> <0175> <006d> <0176> <006e> <017d> <006f> <01ffffff89> <0070> <01ffffff8c> <0072> <01ffffff90> <0073> <01ffffff9a> <0074> <01ffffffb5> <0075> <01ffffffc7> <0079> endbfchar
100 beginbfchar
<01ffffffcc> <007a> endcmap
CMapName
currentdict
/CMap defineresource
pop
end
end
ý
iText(Sharp)绊倒的部分是:
The section where iText(Sharp) stumbles is:
100 beginbfchar
<01ffffffcc> <007a> endcmap
即以beginbfchar
开头并以不匹配的endcmap
结尾的部分.
i.e. a section started by beginbfchar
and ended by the non-matching endcmap
.
我认为以beginbfchar
开头的部分总是必须以endbfchar
结尾.
I think a section started by beginbfchar
always has to end in endbfchar
.
有问题的字体是 Calibri 子集复合字体.它以xobject的形式在首页上用作 Fm0 .该xobject有一个字典条目
The font in question is a Calibri subset composite font. It is used in the form xobject used as Fm0 on the first page. That xobject has a dictionary entry
/PTEX.FileName (C:/MyFiles/Publications/DisparityMetric/Figures/Teaser.pdf)
因此它很可能已经从Teaser.pdf文件中复制了.
so it probably has been copied from that Teaser.pdf file.
这篇关于使用iTextSharp提取文本会引发InvalidCastException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!