使用 itextsharp 阅读 PDF,其中 PDF 语言为非英语 [英] Read PDF using itextsharp where PDF language is non-English

查看:21
本文介绍了使用 itextsharp 阅读 PDF,其中 PDF 语言为非英语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 C# 中的 itextsharp 阅读 this PDF将此pdf转换为word文件.还需要在word中维护表格格式和字体当我尝试使用英语 pdf 时,它会完美运行,但使用某些印度语言(如印地语、马拉地语)却无法正常工作.

I am trying to read this PDF using itextsharp in C# which will convert this pdf into word file. also it needs to maintain table formating and fonts in word when i try with English pdf it will work perfectly but using some of the Indian languages like Hindi, Marathi it is not working.

 public string ReadPdfFile(string Filename)
        {

            string strText = string.Empty;
            StringBuilder text = new StringBuilder();
            try
            {
                PdfReader reader = new PdfReader((string)Filename);
                if (File.Exists(Filename))
                {
                    PdfReader pdfReader = new PdfReader(Filename);

                    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                    {                        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                        string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                        text.Append(currentText);
                        pdfReader.Close();
                    }
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
            textBox1.Text = text.ToString();
            return text.ToString(); ;
        }

推荐答案

我检查了你的文件,特别关注你的样本मतद|र"在文档页面的最上面一行被提取为मतदरर".

I inspected your file with a special focus on your sample "मतद|र" being extracted as "मतदरर" in the topmost line of the document pages.

>

简而言之:

您的文档本身提供的信息例如标题行中的字形मतद|र"代表文本मतदरर".您应该向文档来源询问字体信息不会误导的文档版本.如果这不可能,您应该使用 OCR.

Your document itself provides the information that e.g. the glyphs "मतद|र" in the head line represent the text "मतदरर". You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.

详细说明:

第一页的顶行由页面内容流中的以下操作生成:

The top line of the first page is generated by the following operations in the page content stream:

/9 280 Tf
(-12"!%$"234%56*5) Tj

第一行选择字体 /9 的大小为 280(页面开头的操作将所有内容按 0.05 的因子进行缩放;因此,有效大小为 14 个单位,即你在文件中观察).

The first line selects the font named /9 at a size of 280 (an operation at the beginning of the page scales everything by a factor of 0.05; thus, the effective size is 14 units which you observe in the file).

第二行打印字形.使用该字体的自定义编码在括号之间引用这些字形.

The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.

当程序尝试提取文本时,它必须使用字体信息从这些字形引用中推断出实际字符.

When a program tries to extract the text, it has to deduce the actual characters from these glyph references using information from the font.

PDF 第一页上的字体 /9 是使用这些对象定义的:

The font /9 on the first page of your PDF is defined using these objects:

242 0 obj<<
    /Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94
    /Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>>
endobj
243 0 obj/CDAC-GISTSurekh-Bold+0
endobj 
247 0 obj<<
    /Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R
    /Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0
    /Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>>
endobj 

所以没有 /Encoding 元素,但至少有一个对 /ToUnicode 映射的引用.因此,提取文本的程序必须依赖于给定的 /ToUnicode 映射.

So there is no /Encoding element but at least there is a reference to a /ToUnicode map. Thus, a program extracting text has to rely on the given /ToUnicode mapping.

从 (-12"!%$"234%56*5) 中提取文本时,/ToUnicode 引用的流包含以下感兴趣的映射:

The stream referenced by /ToUnicode contains the following mappings of interest when extracting the text from (-12"!%$"234%56*5):

<21> <21> <0930>
<22> <22> <0930>
<24> <24> <091c>
<25> <25> <0020>
<2a> <2a> <0031>
<2d> <2d> <092e>
<31> <31> <0924>
<32> <32> <0926>
<33> <33> <0926>
<34> <34> <002c>
<35> <35> <0032>
<36> <36> <0030>

(在这里你已经可以看到多个字符代码被映射到同一个 unicode 代码点......)

(Already here you can see that multiple character codes are mapped to the same unicode code point...)

因此,文本提取必须导致:

Thus, text extraction must result in:

- = 0x2d -> 0x092e = म
1 = 0x31 -> 0x0924 = त
2 = 0x32 -> 0x0926 = द
" = 0x22 -> 0x0930 = र    instead of  |
! = 0x21 -> 0x0930 = र
% = 0x25 -> 0x0020 =  
$ = 0x24 -> 0x091c = ज
" = 0x22 -> 0x0930 = र
2 = 0x32 -> 0x0926 = द
3 = 0x33 -> 0x0926 = द
4 = 0x34 -> 0x002c = ,
% = 0x25 -> 0x0020 =  
5 = 0x35 -> 0x0032 = 2
6 = 0x36 -> 0x0030 = 0
* = 0x2a -> 0x0031 = 1
5 = 0x35 -> 0x0032 = 2

因此,从文档第一页的标题中提取的文本 iTextSharp(以及 Adob​​e Reader!)正是该文档在其字体信息中声称的正确内容.

Thus, the text iTextSharp (and also Adobe Reader!) extract from the heading on the first document page is exactly what the document in its font informations claims is correct.

由于字体定义中的映射信息具有误导性,因此整个文档存在误解也就不足为奇了.

As the cause for this is the misleading mapping information in the font definition, it is not surprising that there are misinterpretations all over the document.

这篇关于使用 itextsharp 阅读 PDF,其中 PDF 语言为非英语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆