iTextSharp 5.5.13.1从PDF提取文本时,没有数据可用于编码10000 [英] iTextSharp 5.5.13.1 no data is available for encoding 10000 when extracting text from PDF

查看:117
本文介绍了iTextSharp 5.5.13.1从PDF提取文本时,没有数据可用于编码10000的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从多页PDF文档中提取文本,并且几乎所有文档都可以很好地提取,但是有两个文档因10000错误编码而崩溃.无法使用的文档页面唯一的独特之处在于它们上有一个按钮并在其上形成了表单字段.

I'm trying to extract text from a multipage PDF document and almost all documents extract fine, but a couple of documents blow up with the encoding 10000 error. The only unique thing about the document pages that don't work is that they have a button and form fields on them.

            {
                var pageNumbersToSave = new List<int>();
                for (var i = 1; i <= r.NumberOfPages; i++)
                {
                    try
                    {
                        var s       = PdfTextExtractor.GetTextFromPage( r, i, new SimpleTextExtractionStrategy() );

我还尝试使用PDFStamper来使表单元素变平,但这并没有改变任何内容:

I also tried using a PDFStamper to flatten the form elements but that didn't change anything:

            byte[] flatBytes;
            using ( var r = new PdfReader( pdfBytes ) )
            {
                using (var ms = new MemoryStream())
                {
                    using (var flattener = new PdfStamper(r, ms))
                    {
                        for ( var i = 1; i <= r.NumberOfPages; i++ )
                        {
                            r.AcroFields.RemoveFieldsFromPage( i );
                        }
                        flattener.FormFlattening = true;
                        flattener.Close();
                    }
                    flatBytes = ms.ToArray();
                }
            }

如果在用压模进行测试时,显然在顶部代码中,我使用的是flatBytes,而不是pdfBytes.

Obviously in the top code if I was testing with the stamper I was using flatBytes and not pdfBytes.

完整的异常消息:没有数据可用于编码10000.有关定义自定义编码的信息,请参阅Encoding.RegisterProvider方法的文档.

Full exception message: No data is available for encoding 10000. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.

Stack Trace:
   at System.Text.Encoding.GetEncoding(Int32 codepage)
   at System.Text.Encoding.GetEncoding(Int32 codepage, EncoderFallback encoderFallback, DecoderFallback decoderFallback)
   at iTextSharp.text.xml.simpleparser.IanaEncodings.GetEncodingEncoding(String name)
   at iTextSharp.text.pdf.PdfEncodings.ConvertToString(Byte[] bytes, String encoding)
   at iTextSharp.text.pdf.DocumentFont.FillEncoding(PdfName encoding)
   at iTextSharp.text.pdf.DocumentFont.DoType1TT()
   at iTextSharp.text.pdf.DocumentFont.Init()
   at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
   at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi, ICollection markedContentInfoStack)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at VerataParsers.ECWPdfExtractor.StripImagesFromPdf(Byte[] pdfBytes, Int32& adjustedPageCount) in C:\Users\Dell T5610\source\repos\SecureDirectMessaging\VerataParsers\ECWPdfExtractor.cs:line 85

推荐答案

通过添加System.Text.Encoding.CodePages NuGet程序包并按如下所示注册它来解决此问题:

Fixed by adding the System.Text.Encoding.CodePages NuGet package and then registering it as follows:

            var codePages = CodePagesEncodingProvider.Instance;
            Encoding.RegisterProvider(codePages);

这篇关于iTextSharp 5.5.13.1从PDF提取文本时,没有数据可用于编码10000的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆