使用itextsharp在c#中提取阿拉伯语文本 [英] extracting Arabic text in c# by using itextsharp

查看:380
本文介绍了使用itextsharp在c#中提取阿拉伯语文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个代码,我用它来获取PDF文本。这对于英文PDF非常有用,但是当我试图用阿拉伯语提取文本时,它会向我显示这样的内容。

I have this code and I'm using it to take the text of a PDF. It's great for a PDF in English but when I'm trying to extract the text in Arabic it shows me something like this.


) + n 9 n< +,+)+ $ $ $ + $ F%9&。< $:;

") + n 9 n <+, + )+ $ # $ +$ F% 9& .< $ : ;"



using (PdfReader reader = new PdfReader(path))
{
     ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     String text = "";
     for (int i = 1; i <= reader.NumberOfPages; i++)
     {
          text = PdfTextExtractor.GetTextFromPage(reader, i,strategy);
     }

}


推荐答案

我必须改变这样的策略

var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
var te = Convert(t);

此功能可以反转阿拉伯语单词并保留英语

and this function to reverse the Arabic words and keep the English

  private string Convert(string source)
  {
       string arabicWord = string.Empty;
       StringBuilder sbDestination = new StringBuilder();

       foreach (var ch in source)
       {
           if (IsArabic(ch))
               arabicWord += ch;
           else
           {
               if (arabicWord != string.Empty)
                    sbDestination.Append(Reverse(arabicWord));

               sbDestination.Append(ch);
               arabicWord = string.Empty;
            }
        }

        // if the last word was arabic    
        if (arabicWord != string.Empty)
            sbDestination.Append(Reverse(arabicWord));

        return sbDestination.ToString();
     }


     private bool IsArabic(char character)
     {
         if (character >= 0x600 && character <= 0x6ff)
             return true;

         if (character >= 0x750 && character <= 0x77f)
             return true;

         if (character >= 0xfb50 && character <= 0xfc3f)
             return true;

         if (character >= 0xfe70 && character <= 0xfefc)
             return true;

         return false;
     }

     // Reverse the characters of string
     string Reverse(string source)
     {
          return new string(source.ToCharArray().Reverse().ToArray());
     }

这篇关于使用itextsharp在c#中提取阿拉伯语文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆