在 C# 中使用 iTextSharp 读取 pdf 内容 [英] Reading pdf content using iTextSharp in C#

查看:172
本文介绍了在 C# 中使用 iTextSharp 读取 pdf 内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用此代码使用 iTextSharp 读取 pdf 内容.当内容是英语时它工作正常,但当内容是波斯语或阿拉伯语时它不起作用
结果是这样的:
这里是用于测试的示例非英语 PDF.

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic
Result is something like this :
Here is sample non-English PDF for test.

َٛنا ÙÙ"ب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØÙ" Ù‚Ù›Ù…ØÛŒÙ"بٕس © Karl Seguin foppersian.codeplex.comwww.codebetter.com 1 1 ÙÙ"ب٘طث َٛنا یؿیٛÙ~

َٛنا ÙÙ"ب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØ­Ù" قٛمح ÛŒÙ"بٕس © Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 ÙÙ"ب٘طث َٛنا یؿیٛ٘

همانرب لوصا یسیون  مرن دیلوت رتهب رازÙا

解决办法是什么?

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }

推荐答案

在 .Net 中,一旦你有了一个字符串,你就有了一个字符串,它是 Unicode,总是强>.实际的内存中实现是 UTF-16,但这并不重要.永远永远不要将字符串分解为字节并尝试将其重新解释为不同的编码并将其作为字符串重新解释,因为这没有意义并且几乎总是会失败.

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

你的问题是这一行:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

我将把它分成几行来说明:

I'm going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی

该代码会混淆 127 ASCII 以上的任何内容.删除重新编码行,你应该会很好.

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

旁注,完全有可能创建字符串的任何东西都做错了,这实际上并不少见.但是你需要在之前解决这个问题,它变成一个string,在byte级别.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

编辑

除了应该删除一行之外,代码应该与上面的完全相同.此外,无论您使用什么来显示文本,请确保它支持 Unicode.另外,正如@kuujinbo 所说,请确保您使用的是最新版本的 iTextSharp.我用 5.2.0.0 对此进行了测试.

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName)) {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

编辑 2

上面的代码修复了编码问题,但没有修复字符串本身的顺序.不幸的是,这个问题似乎出在 PDF 级别本身.

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

因此,在这种从右到左的书写系统中显示文本需要单独定位每个字形(这很乏味并且代价高昂)或用显示字符串表示文本(见 9.2,字体的组织和使用"),其字符代码在逆序.

Consequently, showing text in such right-to-left writing systems requires either positioning each glyph individually (which is tedious and costly) or representing text with show strings (see 9.2, "Organization and Use of Fonts") whose character codes are given in reverse order.

PDF 2008 规范 - 14.8.2.3.3 - 逆序显示字符串

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

当重新排序上述字符串时,内容(如果我正确理解规范)应该使用标记内容"部分,BMC.但是,我查看和生成的少数示例 PDF 似乎并没有真正做到这一点.在这方面我绝对可能是错的,因为这不是我的专长,所以你必须多花些功夫.

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

这篇关于在 C# 中使用 iTextSharp 读取 pdf 内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆