使用iTextSharp提取文本会引发InvalidCastException [英] Extracting text with iTextSharp throws an InvalidCastException

查看:73
本文介绍了使用iTextSharp提取文本会引发InvalidCastException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用iTextSharp从PDF文件中提取文本.

I'm currently extracting text from a PDF file with iTextSharp.

使用数十个PDF可以正常工作,但是其中2个PDF抛出无效的强制转换,除了Stacktrace在[1]上.

With dozens of PDFs it works fine however 2 of the PDFs throws an invalid cast exceptio Stacktrace at [1].

引发此异常的代码如下(该异常引发GetTextFromPage):

The code which throws this exception is the following (the exception throws at GetTextFromPage):

        PdfReader reader = new PdfReader(byteArray);
        PdfTextExtractor.GetTextFromPage(reader, 1, new SimpleTextExtractionStrategy());

一些附加说明:

  • Adob​​e Acrobat中的预检语法检查未发现任何错误.
  • 产生此错误的示例PDF位于: http://resources.mpi-inf .mpg.de/DisparityModel (本文(Adobe Acrobat PDF,6.69 MB).)
  • 我已经尝试过LocationTextExtractionStrategy-相同的错误.
  • The Preflight Syntax check in Adobe Acrobat doesn't find any errors.
  • A sample PDF which generates this error is located at: http://resources.mpi-inf.mpg.de/DisparityModel (The paper (Adobe Acrobat PDF, 6.69 MB). )
  • I tried already the LocationTextExtractionStrategy - same error.

如何在预检旁检查PDF文件是否已损坏?还是这个错误来自哪里?

How can I check the PDF file if it is corrupt, beside the Preflight? Or where does this error come from?

[1]

System.InvalidCastException was unhandled
  HResult=-2147467262
  Message=Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'.
  Source=itextsharp
  StackTrace:
       at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
       at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
       at iTextSharp.text.pdf.DocumentFont.Init()
       at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
       at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
       at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
       at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
       at ConsoleApplication1.Program.Main(String[] args) in e:\foobar\projects\AnalyzePDF\ConsoleApplication1\ConsoleApplication1\Program.cs:line 24
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: 

推荐答案

所讨论的文档包含具有以下 ToUnicode 映射的字体:

The document in question contains a font with the following ToUnicode map:

/CIDInit /ProcSet findresource
begin
12 dict
begin
/CIDSystemInfo <</Ordering (UCS) /Registry (Adobe) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffffffffffffffff> endcodespacerange
20 beginbfchar
<0003> <0020> <0012> <0043> <0018> <0044> <0045> <004e> <0059> <0051> <005e> <0053> <0102> <0061> <0110> <0063> <011a> <0064> <011e> <0065> <015d> <0069> <0175> <006d> <0176> <006e> <017d> <006f> <01ffffff89> <0070> <01ffffff8c> <0072> <01ffffff90> <0073> <01ffffff9a> <0074> <01ffffffb5> <0075> <01ffffffc7> <0079> endbfchar
100 beginbfchar
<01ffffffcc> <007a> endcmap
CMapName
currentdict
/CMap defineresource
pop
end
end
ý

iText(Sharp)绊倒的部分是:

The section where iText(Sharp) stumbles is:

100 beginbfchar
<01ffffffcc> <007a> endcmap

即以beginbfchar开头并以不匹配的endcmap结尾的部分.

i.e. a section started by beginbfchar and ended by the non-matching endcmap.

我认为以beginbfchar开头的部分总是必须以endbfchar结尾.

I think a section started by beginbfchar always has to end in endbfchar.

有问题的字体是 Calibri 子集复合字体.它以xobject的形式在首页上用作 Fm0 .该xobject有一个字典条目

The font in question is a Calibri subset composite font. It is used in the form xobject used as Fm0 on the first page. That xobject has a dictionary entry

/PTEX.FileName (C:/MyFiles/Publications/DisparityMetric/Figures/Teaser.pdf)

因此它很可能已经从Teaser.pdf文件中复制了.

so it probably has been copied from that Teaser.pdf file.

这篇关于使用iTextSharp提取文本会引发InvalidCastException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆