阅读在C#中使用iTextSharp的PDF内容 [英] Reading pdf content using iTextSharp in C#

查看:365
本文介绍了阅读在C#中使用iTextSharp的PDF内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用这个code。使用iTextSharp的阅读PDF内容。当内容是英语,但它不工作whene内容是波斯语或阿拉伯语搜索结果是这样的正常工作:
结果这里是测试样品非英语PDF。


  

UZU> U的†Ø§UUبÙ~طثیؿیÙ>Ù~زؾاUU>ÙØÙÙ,Ù> U的... O
  ÛŒÙ欧•Ø³Â©卡尔·塞甘foppersian。codeplex.com
  WWW。codebetter.com 1 UUبÙ~طثUZU> U的†Ø§ÛŒØ¿ÛŒÙ>Ù~

  U的‡U.。انØرUÙØμایسیÙÙ†U.。رندیÙÙترتهØØر §Ø²Ùا


如何解决?

 公共字符串ReadPdfFile(字符串文件名)
        {
            StringBuilder的文本=新的StringBuilder();            如果(File.Exists(文件名))
            {
                PdfReader pdfReader =新PdfReader(文件名);                对于(INT页= 1;页< = pdfReader.NumberOfPages;网页++)
                {
                    ITextExtractionStrategy策略=新SimpleTextExtractionStrategy();
                    字符串currentText = PdfTextExtractor.GetTextFromPage(pdfReader,页面策略);                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            返回text.ToString();
        }


解决方案

在.NET中,一旦你有一个字符串,您有一个字符串,然后是统一code,<强>总是。实际内存中执行是UTF-16,但这并不重要。永远,永远,永远分解字符串转换成字节,并尝试reinter preT它作为一个不同的编码,并拍回作为一个字符串,因为这是没有意义的,并且几乎总是失败。

您的问题是这一行:

  currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default,Encoding.UTF8,Encoding.UTF8.GetBytes(currentText)));

我打算把它分割成几行来说明:

 字节[]字节= Encoding.UTF8.GetBytes(ی); //字节现在持有0xDB8C
字节[] =转换Encoding.Convert(Encoding.Default,Encoding.UTF8,字节); //现在转换持有0xC39BC592
字符串最后= Encoding.UTF8.GetString(转换); //最后现在持有UOE

在code将混合了127 ASCII屏障高于一切。掉落重新编码线,你应该是不错的。

边注,这是完全可能的,无论在字符串做它不正确,那不是太罕见实际。但是,你需要解决这个问题的就变成了字符串,在字节的水平。

修改

在code应该是完全相同的上面你除了一个行应该被删除。此外,无论您使用的是在显示文本时,请确保它支持单向code。此外,作为@kuujinbo说,请确保您使用的是最新的iTextSharp的版本。我测试用5.2.0.0。

 公共字符串ReadPdfFile(字符串文件名){
        StringBuilder的文本=新的StringBuilder();        如果(File.Exists(文件名)){
            PdfReader pdfReader =新PdfReader(文件名);            对于(INT页= 1;页&LT; = pdfReader.NumberOfPages;网页++){
                ITextExtractionStrategy策略=新SimpleTextExtractionStrategy();
                字符串currentText = PdfTextExtractor.GetTextFromPage(pdfReader,页面策略);                text.Append(currentText);
            }
            pdfReader.Close();
        }
        返回text.ToString();
    }

编辑2

以上code修复编码问题,但不能解决字符串本身的顺序。不幸的是这个问题似乎是在PDF水平本身。


  

因此​​,在这种从右到左书写系统显示的文字
  无论是要求每个字形分别定位(这是乏味
  和昂贵的)或再presenting文本与显示字符串(见9.2,
  组织和字体的使用),其字符codeS中给出
  相反的顺序。


2008年PDF规格 - 14.8.2.3.3 - 反向顺序显示字符串

在重新排序的字符串,如上面的内容是(如果我理解正确规范)应该使用一个标记的内容部分, BMC 。不过,我已经看了,并产生一些样本PDF文件似乎并没有真正做到这一点。我绝对可能是错在这部分,因为这是很多不是我的专长,所以你必须闲逛让更多的。

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic
Result is something like this :
Here is sample non-English PDF for test.

َٛنا ÙÙ"ب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØ­Ù" قٛمح ÛŒÙ"بٕس © Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 ÙÙ"ب٘طث َٛنا یؿیٛ٘

همانرب لوصا یسیون  مرن دیلوت رتهب رازÙا

What is the solution ?

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }

解决方案

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

Your problem is this line:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I'm going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

EDIT

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName)) {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

EDIT 2

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

Consequently, showing text in such right-to-left writing systems requires either positioning each glyph individually (which is tedious and costly) or representing text with show strings (see 9.2, "Organization and Use of Fonts") whose character codes are given in reverse order.

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

这篇关于阅读在C#中使用iTextSharp的PDF内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆