使用iTextSharp阅读PDF的奇怪字符 [英] Strange characters reading PDF with iTextSharp

查看:146
本文介绍了使用iTextSharp阅读PDF的奇怪字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iTextSharp来阅读PDF文件。我尝试使用以下简单代码阅读第一页的全文:

  var pdfReader = new PdfReader(< fileName> ;); 
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader,1,new SimpleTextExtractionStrategy());

它返回如下字符串:



\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \\ \\ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \\ \\ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \\ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \\ \\ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \0 \0!\ n \\0 \0 \0 \0 \ 0 \ 0#\ 0 $ \0%\ 0& $ \0'\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \ 0 \ 0 \0 \0 \0 \0 \0 \0 \0!\ n \\0 \0 \ 0(\ n \\\ 0 \0 \ 0 \\ )\0 \0 * \0 +,\ 0,\0 \0&,\ 0 - \0。#\0 \ 0 \ 0& $ \0 ,\0 / \ n + \0&& \0 * 0 \0 1。\ n2 \ 0 3 \ n4 - \0 5 \0 \0 $ \0 \ 0#\0 \0 \0& $ \ 0,\0 *& \0 \\\'\0。\ n6 \ n \\\ \\ \\ 0 \\ \\ \\ 0 - \0 \0 \0 \0& \0 \0 \0 \0 \0 \0 \ 0,\ 0#\0 \0 \\ \\ \\ 0& $ \ 0,\0 \0 \ 0& \0#\\\ \0& $')& \0 \0 \\\ \ 0 \ 0# \0''\ 0 7 - \0 $ \0 \0 7 \0'\ 0,\ 0 8 \ n9 5 \0 \ 0,\0 \0 $ $ \0 \0 \0 \0 \\\'\ 0 \ 0 3 \ n \ 0 \\ 0 \0)\0 \0 \0 \0 4 - \0 5 \0 \0 $ \0 \0 *& \0 \\\'\ 0 。\\\\\\\#\ 0 $ \ 0 $ \0 \0)\0 \0 \ 0:0; \ 0;< ;:1; + \0 =< 9 =<<> \ 0?\ 0?\ 0 3 \ 0(\ n @ \ n \\\ 0 \ 0#\ 0 $ \0%\0& $ \\\'\ 0!3 \ n \ n ...... ......



I可以使用Acrobat Reader和浏览器阅读原始PDF。该文件似乎是PDF / A.



我使用的代码与其他PDF一起使用。



iText是否存在此标准的问题?



有人能指出我正确的方向吗?



更新



从Acrobat复制/粘贴会给我带来损坏的文字。我不认为这是一个iTextSharp(5.5.10)问题。



更新



您可以尝试使用此文件:



这些条目表明该文件实际上甚至没有尝试实际上是PDF / A-a1符合,它只是声明所以。


I'm using iTextSharp to read a PDF file. I try to read the full text in the first page with this simple code:

var pdfReader = new PdfReader("<fileName>");
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new SimpleTextExtractionStrategy());

It returns a string like this:

"\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 \0 \0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 (\n\0 \0 \0 ) \0 \0 * \0 + , \0 , \0 \0 & , \0 - \0 . # \0 \0 \0 & $ \0 , \0 /\n+ \0 & & \0 * 0 \0 1 .\n2 \0 3\n4 - \0 5 \0 \0 $ \0 \0 # \0 \0 \0 & $ \0 , \0 * & \0 \0 ’ \0 .\n6\n\0 \0 \0 - \0 \0 \0 \0 & \0 \0 \0 \0 \0 \0 \0 , \0 # \0 \0 \0 & $ \0 , \0 \0 \0 & \0 # \0 \0 & $ ’ ) & \0 \0 \0 \0 # \0 ’ ’ \0 7 - \0 $ \0 \0 7 \0 ’ \0 , \0 8\n9 5 \0 \0 , \0 \0 $ $ \0 \0 \0 \0 \0 ’ \0 \0 3\n\0 \0 \0 ) \0 \0 \0 \0 4 - \0 5 \0 \0 $ \0 \0 * & \0 \0 ’ \0 .\n\0 \0 \0 \0 # \0 $ \0 $ \0 \0 ) \0 \0 \0 : 0 ; \0 ; < ; : 1 ; + \0 = < 9 = < < > \0 ? \0 ? \0 3 \0 (\n@\n\0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 ! 3\n\0 ......"

I can read the original PDF with Acrobat Reader and browsers. The file seems to be a PDF/A.

The code I use works with other PDF.

Does iText have problem with this standard?

Can someone point me to the right direction?

Update

Copy/paste from Acrobat gives me broken text. I don't think it's an iTextSharp (5.5.10) problem.

Update

You can try with this file: PDF Example

解决方案

The file does not contain information required for text extraction. Furthermore, the file is invalid as a PDF/A file.

Information for text extraction

The sample file contains a background (located in a form XObject resource) showing the empty form and a foreground (immediately in the page content stream) of filled-in values.

The text in the form XObject is drawn using a Type 3 font without a standard encoding or standard names in its encoding. There also is no ToUnicode map in it.

This means that text drawing instructions in that form XObject have arguments which are sequences of bytes, and for each byte value the Type 3 font object provides a stream containing simple drawing instructions (path definitions using lines and curves; path filling instructions), but there is no information which Unicode value corresponds to that byte value or set of drawing instructions.

Thus, PDF viewers can draw the page but they cannot correctly put a Unicode string of characters into the clipboard which we as humans would read from that drawing, and neither can iTextSharp.

Short of OCR there is no reasonable way to extract text from the form.


The text immediately in the foreground, on the other hand, is drawn using a font with a standard encoding (WinAnsiEncoding) and, therefore, can be extracted. Thus, at the end of the output of the OP's code you'll find

\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000

 ...

\u0000 \u0000 \u0000 x s \u0000 l t n q o x m l \u0000 z \u0000 ~ { \u0000 } } \u0000 l w x
2016
14874587948 DITTA PROVA SRL
CREMA CR 26013 VIA DANTE 17
011110
LPRGCM82T26D150H LEOPARDI GIACOMO
M 26 12 1982 CREMONA CR
MILANO MI F205
28 02 2017
DITTAP0101 / LEOGIA01001

i.e. the filled-in values of the form.

PDF/A conformance

The file indeed claims to be PDF/A-1a but inspecting it one quickly sees that this is a blatant lie. E.g. Adobe Acrobat Preflight says:

These entries indicate that the document actually does not even try to actually be PDF/A-a1 conform, it merely claims so.

这篇关于使用iTextSharp阅读PDF的奇怪字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆