使用iTextSharp的,其中PDF语言非英语阅读PDF [英] Read PDF using itextsharp where PDF language is non-English

查看:350
本文介绍了使用iTextSharp的,其中PDF语言非英语阅读PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读使用iTextSharp的在C#这 PDF将转换这个PDF转换为Word文件。还需要维护表格式化和字体的字 当我尝试用英文PDF格式,将工作完美,但使用一些印度语言,如印地文,马拉它不能正常工作。

 公共字符串ReadPdfFile(字符串文件名)
        {

            字符串strText中的String.Empty =;
            StringBuilder的文本=新的StringBuilder();
            尝试
            {
                PdfReader读卡器=新PdfReader((串)文件名);
                如果(File.Exists(文件名))
                {
                    PdfReader pdfReader =新PdfReader(文件名);

                    对于(INT页= 1;网页< = pdfReader.NumberOfPages;网页++)
                    {ITextExtractionStrategy策略=新SimpleTextExtractionStrategy();
                        字符串currentText = PdfTextExtractor.GetTextFromPage(pdfReader,页面,策略);

                        text.Append(currentText);
                        pdfReader.Close();
                    }
                }
            }
            赶上(例外前)
            {
                的MessageBox.show(ex.Message);
            }
            textBox1.Text = text.ToString();
            返回text.ToString(); ;
        }
 

解决方案

我检查你的文件特别侧重于样品मतद|र正在文档页面的最上面一行提取作为मतदरर

简而言之:

您的文件本身提供的信息,例如字形मतद|र,在第一行,再present文本मतदरर。你应该问你的文档的源文件版本,其中的字体信息不误导。如果这是不可能的,你应该去OCR。

详细:

的第一页的顶部线是由在页面内容流如下操作生成:

  / 9 280铁蛋白
(-12!%$234%56 * 5)TJ
 

的第一行选择命名的字体的 / 9 在280(的动作的大小在页开头鳞一切由的0.05倍;因此,有效尺寸是14个单位哪您在文件中看到)。

第二行将字形进行打印。这些字形参考插图中使用该字体的自定义编码的括号内。

当某个程序试图提取文本,它必须使用从字体的信息,这些字形参考推断实际的字符。

字体 / 9 在PDF的第一页使用这些对象的定义:

  242 0 OBJ<<
    /类型/字体/名称/ 9 / BASEFONT 243 0 R / FirstChar 33 / LastChar 94
    /亚型/ TrueType字体/ ToUni code 244 0 R / FontDescriptor 247 0 R /宽度248 0 R>>
endobj
243 0 OBJ /华助会-GISTSurekh粗体+ 0
endobj
247 0 OBJ<<
    /类型/ FontDescriptor / FontFile2 245 0 R / FontBBox 246 0 R / FONTNAME 243 0 R
    /标志4 / MissingWidth 946 / StemV 0 / StemH 0 /大写高度500 / XHeight 0
    /的Ascent 1050 /下降-400 /领先0/1892了maxWidth / AvgWidth 946 / ItalicAngle 0>>
endobj
 

因此​​,有没有 /编码元素,但至少有一个参照 / ToUni code 地图。因此,程序提取文本必须依赖于给定的 / ToUni code 映射。

/ ToUni code引用的流包含感兴趣以下映射提取的文本时(-12 234%,56·5%$!):

 < 21基&所述; 21基< 0930>
&其中22为氢; &其中22为氢; < 0930>
&所述; 24每个&所述; 24每个< 091c>
α25> α25> < 0020>
&所述2a取代; &所述2a取代; < 0031>
< 2D> < 2D> < 092e>
&所述31取代; &所述31取代; < 0924>
&所述32取代; &所述32取代; < 0926>
γ-33> γ-33> < 0926>
< 34> < 34> < 002C>
&所述;例如35 &所述;例如35 < 0032>
< 36> < 36> < 0030>
 

(已经在这里你可以看到,多个字符codeS被映射到相同的UNI code code点......)

因此​​,文本提取一定导致:

   -  = 0x2d  - > 0x092e =म
1 = 0X31  - > 0x0924 =त
2 = 0x32  - > 0x0926 =द
=输入0x22  - > =代替0x0930र|
! = 0×21  - > 0x0930 =र
%= 0x25  - > 0×0020 =
$ = 0X24  - > 0x091c =ज
=输入0x22  - > 0x0930 =र
2 = 0x32  - > 0x0926 =द
3 = 0x33  - > 0x0926 =द
4 = 0x34  - > 0x002c =,
%= 0x25  - > 0×0020 =
5 = 0x35  - > 0x0032 = 2
6 = 0x36数据 - > 0x0030 = 0
* = 0x2a  - > 0x0031 = 1
5 = 0x35  - > 0x0032 = 2
 

因此​​,文本iTextSharp的(也是Adobe Reader的!),从标题的第一个文档页面上抽取正是在其字体信息的文件声称是正确的。

由于原因这是在字体定义的误导性映射信息,它有misinter pretations遍布文件也就不足为奇了。

I am trying to read this PDF using itextsharp in C# which will convert this pdf into word file. also it needs to maintain table formating and fonts in word when i try with English pdf it will work perfectly but using some of the Indian languages like Hindi, Marathi it is not working.

 public string ReadPdfFile(string Filename)
        {

            string strText = string.Empty;
            StringBuilder text = new StringBuilder();
            try
            {
                PdfReader reader = new PdfReader((string)Filename);
                if (File.Exists(Filename))
                {
                    PdfReader pdfReader = new PdfReader(Filename);

                    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                    {                        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                        string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                        text.Append(currentText);
                        pdfReader.Close();
                    }
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
            textBox1.Text = text.ToString();
            return text.ToString(); ;
        }

解决方案

I inspected your file with a special focus on your sample "मतद|र" being extracted as "मतदरर" in the topmost line of the document pages.

In a nutshell:

Your document itself provides the information that e.g. the glyphs "मतद|र" in the head line represent the text "मतदरर". You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.

In detail:

The top line of the first page is generated by the following operations in the page content stream:

/9 280 Tf
(-12"!%$"234%56*5) Tj

The first line selects the font named /9 at a size of 280 (an operation at the beginning of the page scales everything by a factor of 0.05; thus, the effective size is 14 units which you observe in the file).

The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.

When a program tries to extract the text, it has to deduce the actual characters from these glyph references using information from the font.

The font /9 on the first page of your PDF is defined using these objects:

242 0 obj<<
    /Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94
    /Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>>
endobj
243 0 obj/CDAC-GISTSurekh-Bold+0
endobj 
247 0 obj<<
    /Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R
    /Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0
    /Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>>
endobj 

So there is no /Encoding element but at least there is a reference to a /ToUnicode map. Thus, a program extracting text has to rely on the given /ToUnicode mapping.

The stream referenced by /ToUnicode contains the following mappings of interest when extracting the text from (-12"!%$"234%56*5):

<21> <21> <0930>
<22> <22> <0930>
<24> <24> <091c>
<25> <25> <0020>
<2a> <2a> <0031>
<2d> <2d> <092e>
<31> <31> <0924>
<32> <32> <0926>
<33> <33> <0926>
<34> <34> <002c>
<35> <35> <0032>
<36> <36> <0030>

(Already here you can see that multiple character codes are mapped to the same unicode code point...)

Thus, text extraction must result in:

- = 0x2d -> 0x092e = म
1 = 0x31 -> 0x0924 = त
2 = 0x32 -> 0x0926 = द
" = 0x22 -> 0x0930 = र    instead of  |
! = 0x21 -> 0x0930 = र
% = 0x25 -> 0x0020 =  
$ = 0x24 -> 0x091c = ज
" = 0x22 -> 0x0930 = र
2 = 0x32 -> 0x0926 = द
3 = 0x33 -> 0x0926 = द
4 = 0x34 -> 0x002c = ,
% = 0x25 -> 0x0020 =  
5 = 0x35 -> 0x0032 = 2
6 = 0x36 -> 0x0030 = 0
* = 0x2a -> 0x0031 = 1
5 = 0x35 -> 0x0032 = 2

Thus, the text iTextSharp (and also Adobe Reader!) extract from the heading on the first document page is exactly what the document in its font informations claims is correct.

As the cause for this is the misleading mapping information in the font definition, it is not surprising that there are misinterpretations all over the document.

这篇关于使用iTextSharp的,其中PDF语言非英语阅读PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆