iText7 PdfTextExtractor.GetTextFromPage“"StandardEncoding"不是受支持的编码名称." [英] iText7 PdfTextExtractor.GetTextFromPage "'StandardEncoding' is not a supported encoding name."

查看:174
本文介绍了iText7 PdfTextExtractor.GetTextFromPage“"StandardEncoding"不是受支持的编码名称."的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的软件中有一种方法可以从PDF,扫描或生成的文本中提取文本.

我通常通常首先尝试GetTextFromPage()方法.如果它不返回文本,那么我将转到该页面的OCR.

我有一个特殊的6页PDF,前三页是扫描的文档,后两页是表单.

在此PDF上出现错误,无法解决.

 'StandardEncoding'不是受支持的编码名称.有关定义自定义编码的信息,请参见Encoding.RegisterProvider方法的文档.参数名称:名称在System.Globalization.EncodingTable.internalGetCodePageFromName(字符串名称)处在System.Globalization.EncodingTable.GetCodePageFromName(字符串名称)在iText.IO.Util.IanaEncodings.GetEncodingEncoding(字符串名称)在iText.IO.Util.EncodingUtil.ConvertToBytes(Char []字符,字符串编码)在iText.IO.Font.PdfEncodings.ConvertToBytes(字符串文本,字符串编码)在iText.IO.Font.FontEncoding.FillNamedEncoding()在iText.IO.Font.FontEncoding.CreateFontEncoding(String baseEncoding)在iText.Kernel.Font.PdfType1Font..ctor(PdfDictionary fontDictionary)在iText.Kernel.Font.PdfFontFactory.CreateFont(PdfDictionary fontDictionary)在iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.GetFont(PdfDictionary fontDict)在iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.SetTextFontOperator.Invoke(PdfCanvasProcessor处理器,PdfLiteral运算符,IList`1操作数)处在iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral运算符,IList`1操作数)处在iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte [] contentBytes,PdfResources资源)在iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage页面,ITextExtractionStrategy策略,IDictionary`2另外ContentOperators)在iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage页面)在P:\ Cloud \ Dropbox \ EF Recovery \ OCRTest \ EFR.OCR \ OCR.vb:line 113中的EFR.OCR.OCR.ExtractTextFromPDF(FileInfo fileInfo,Int32 StartingPage,Int32 NumberOfPages)处 

我已经通过代码处理了许多PDF,有些文本,有些扫描,有些混合在一起.有些表格...这是我第一次遇到此错误.

这是我的代码的一部分...

 将阅读器用作新的iText.Kernel.Pdf.PdfReader(fileInfo.FullName)reader.SetUnethicalReading(True)使用sourceDoc作为新的iText.Kernel.Pdf.PdfDocument(阅读器)如果NumberOfPages = 0,则NumberOfPages = sourceDoc.GetNumberOfPages对于i作为整数=从起始页到起始页+ NumberOfPages-1Dim pageText As String ="尝试pageText = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(sourceDoc.GetPage(i))异常捕获OCRLog.Log($尝试从页面{i}中提取文本时出错.{ex.ToString}")结束尝试如果pageText ="然后'提取此页面作为OCRResults的昏暗结果= ExtractTextFromPDFImagePage(fileInfo.FullName,i)pageText = results.TextpageItems.Add(新的OCRResults.PagesClass(results.Accuracy,True,pageText))别的pageItems.Add(New OCRResults.PagesClass(100,False,pageText))万一stringBuilder.Append(pageText)下一个返回新的OCRResults(stringBuilder.ToString,pageItems)最终使用最终使用 

有什么想法吗?

解决方案

PDF中存在错误,正如错误文本'StandardEncoding'不是受支持的编码名称"所指示的一样.

您共享的页面上的字体在其 Encoding 条目中使用名称 StandardEncoding .这不是一个有效的名称.根据ISO 32000-1规范,此处唯一有效的值是 MacRomanEncoding MacExpertEncoding WinAnsiEncoding ,请参阅表111 –类型中的条目1个字体字典–表114 –编码字典中的条目.

Adob​​e Preflight在检查语法错误时也抱怨这些名称:

 密钥与意外的值相关联密钥:BaseEncoding值:/StandardEncoding类型:CosName形式表示:编码角色ID:38遍历路径:-> Pages-> Kids-> [0]-> Resources-> Font-> WARSP-> Encoding密钥与意外的值相关联密钥:编码值:/StandardEncoding类型:CosName正式表示形式:Font.FontType1Cos ID:27遍历路径:-> Pages-> Kids-> [0]-> Resources-> Font-> Arial,Bold密钥与意外的值相关联密钥:BaseEncoding值:/StandardEncoding类型:CosName形式表示:编码Cos ID:22遍历路径:-> Pages-> Kids-> [0]-> Resources-> Font-> Arial-> Encoding密钥与意外的值相关联密钥:BaseEncoding值:/StandardEncoding类型:CosName形式表示:编码Cos ID:19遍历路径:-> Pages-> Kids-> [0]-> Resources-> Font-> ARROW-> Encoding 

(对于共享的PDF,

(摘录)


尽管 StandardEncoding 在此处不是有效名称,但PDF规范知道标准编码",请参见ISO 32000-1的附录D.您的文档很可能在上面概述的位置尝试引用该编码.

因此,如果您需要从有问题的文档中提取文本,则可能要遵循错误消息的建议:

有关定义自定义编码的信息,请参见Encoding.RegisterProvider方法的文档.

这里的 Encoding 类是 System.Text 中的类.

因此,要从PDF中提取文本,只需实现一个 EncodingProvider ,该名称以 StandardEncoding 的名称提供一个 Encoding 实例根据附件D.2(ISO 32000-1的拉丁字符集和编码)中表的 STD 列中的信息.

I have a method in our software that pulls the text from a PDF, from a scan or text generated.

I usually try the GetTextFromPage() method first. If it doesn't return text, then I move onto OCR'ing the page.

I have a particular 6 page PDF with the first three pages being a scanned document, and the last two being a form.

On this PDF I'm getting an error that I can't figure out how to resolve.

'StandardEncoding' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
Parameter name: name

   at System.Globalization.EncodingTable.internalGetCodePageFromName(String name)
   at System.Globalization.EncodingTable.GetCodePageFromName(String name)
   at iText.IO.Util.IanaEncodings.GetEncodingEncoding(String name)
   at iText.IO.Util.EncodingUtil.ConvertToBytes(Char[] chars, String encoding)
   at iText.IO.Font.PdfEncodings.ConvertToBytes(String text, String encoding)
   at iText.IO.Font.FontEncoding.FillNamedEncoding()
   at iText.IO.Font.FontEncoding.CreateFontEncoding(String baseEncoding)
   at iText.Kernel.Font.PdfType1Font..ctor(PdfDictionary fontDictionary)
   at iText.Kernel.Font.PdfFontFactory.CreateFont(PdfDictionary fontDictionary)
   at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.GetFont(PdfDictionary fontDict)
   at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.SetTextFontOperator.Invoke(PdfCanvasProcessor processor, PdfLiteral operator, IList`1 operands)
   at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral operator, IList`1 operands)
   at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[] contentBytes, PdfResources resources)
   at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page, ITextExtractionStrategy strategy, IDictionary`2 additionalContentOperators)
   at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page)
   at EFR.OCR.OCR.ExtractTextFromPDF(FileInfo fileInfo, Int32 StartingPage, Int32 NumberOfPages) in P:\Cloud\Dropbox\EF Recovery\OCRTest\EFR.OCR\OCR.vb:line 113

I've processed many PDFs through my code, some text, some scans, some mixed together. Some had forms... This is the first time that I've had this error.

Here's a snippet of my code...

      Using reader As New iText.Kernel.Pdf.PdfReader(fileInfo.FullName)
        reader.SetUnethicalReading(True)
        Using sourceDoc As New iText.Kernel.Pdf.PdfDocument(reader)
            If NumberOfPages = 0 Then NumberOfPages = sourceDoc.GetNumberOfPages
            For i As Integer = StartingPage To StartingPage + NumberOfPages - 1


                Dim pageText As String = ""
                Try
                    pageText = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(sourceDoc.GetPage(i))
                Catch ex As Exception
                    OCRLog.Log($"Error attempting to extract text from page {i}. {ex.ToString}")
                End Try


                If pageText = "" Then
                    'extract this page
                    Dim results As OCRResults = ExtractTextFromPDFImagePage(fileInfo.FullName, i)
                    pageText = results.Text
                    pageItems.Add(New OCRResults.PagesClass(results.Accuracy, True, pageText))
                Else
                    pageItems.Add(New OCRResults.PagesClass(100, False, pageText))
                End If

                stringBuilder.Append(pageText)
            Next

            Return New OCRResults(stringBuilder.ToString, pageItems)
        End Using
    End Using

Any ideas?

解决方案

There is an error in the PDF, just as indicated by the error text "'StandardEncoding' is not a supported encoding name.".

The fonts on the page you shared use the name StandardEncoding in their Encoding entries. This is not a valid name here. According to the specification ISO 32000-1 the only valid values here are MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding, see Table 111 – Entries in a Type 1 font dictionary – and Table 114 – Entries in an encoding dictionary.

Adobe Preflight also complains about these names when checking for syntax errors:

An unexpected value is associated with the key
  Key: BaseEncoding
  Value: /StandardEncoding
  Type: CosName
  Formal Representation: Encoding
  Cos ID: 38
  Traversal Path: ->Pages->Kids->[0]->Resources->Font->WARSP->Encoding
An unexpected value is associated with the key
  Key: Encoding
  Value: /StandardEncoding
  Type: CosName
  Formal Representation: Font.FontType1
  Cos ID: 27
  Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial,Bold
An unexpected value is associated with the key
  Key: BaseEncoding
  Value: /StandardEncoding
  Type: CosName
  Formal Representation: Encoding
  Cos ID: 22
  Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial->Encoding
An unexpected value is associated with the key
  Key: BaseEncoding
  Value: /StandardEncoding
  Type: CosName
  Formal Representation: Encoding
  Cos ID: 19
  Traversal Path: ->Pages->Kids->[0]->Resources->Font->ARROW->Encoding

(Excerpt from a preflight report for your shared PDF)


In spite of StandardEncoding not being a valid name here, the PDF specification knows a "Standard Encoding", see Annex D of ISO 32000-1. Most likely your document attempts to refer to that encoding at the locations outlined above.

If you need to extract text from the document in question, therefore, you may want to follow the recommendation of the error message:

For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.

The Encoding class here is the one in System.Text.

To extract the text from your PDF, therefore, it should suffice to implement an EncodingProvider that for the name StandardEncoding provides an Encoding instance according to the information from the STD column of the table in Annex D.2 – Latin Character Set and Encodings – of ISO 32000-1.

这篇关于iText7 PdfTextExtractor.GetTextFromPage“"StandardEncoding"不是受支持的编码名称."的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆