使用Python对阿拉伯文本进行快速音译 [英] Fast transliteration for Arabic Text with Python

查看:99
本文介绍了使用Python对阿拉伯文本进行快速音译的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在处理阿拉伯文本文件,为避免编码问题,我根据Buckwalter的计划(http://www.qamus.org/transliteration.htm)将阿拉伯字符音译为英语

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter's scheme (http://www.qamus.org/transliteration.htm)

这是我执行此操作的代码,但是即使对于400 kb之类的小文件,它也非常慢.有什么想法可以使其更快?

Here is my code to do so but it's very SLOW even with small files like 400 kb. Ideas to make it faster?

谢谢

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data

推荐答案

偶然地,有人已经编写了一个脚本来执行此操作,因此您可能需要先检查一下,然后再花费太多时间: buckwalter2unicode.py

Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py

它的作用可能超出了您的需要,但您不必全部使用:我只复制了两个字典和transliterateString函数(我想做了一些调整),并在我的网站上使用了它.

It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

上面的脚本是我一直在使用的脚本,但是我发现它比使用replace慢了许多,特别是对于大型语料库.这是我最终得到的代码,它看起来更简单,更快(引用了字典buck2uni):

The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string

这篇关于使用Python对阿拉伯文本进行快速音译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆