使用 Python 快速转写阿拉伯语文本 [英] Fast transliteration for Arabic Text with Python

查看:36
本文介绍了使用 Python 快速转写阿拉伯语文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我总是处理阿拉伯文本文件,为了避免编码问题,我根据 Buckwalter 的方案 (http://www.qamus.org/transliteration.htm) 将阿拉伯字符音译成英文

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter's scheme (http://www.qamus.org/transliteration.htm)

这是我的代码,但即使是 400 kb 这样的小文件,它也很慢.让它更快的想法?

Here is my code to do so but it's very SLOW even with small files like 400 kb. Ideas to make it faster?

谢谢

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data

推荐答案

Edit Oct 2021

最近发布了一个 python 包,可以做到这一点(还有更多),所以现在阅读这篇文章的任何人都应该忽略所有其他答案,只使用 Camel 工具.(Nizar Habash 和他在纽约大学阿布扎比分校的团队非常出色地开发了它并使其易于使用!)

There was a python package recently released that does this (and a lot more), so anyone reading this post now should ignore all the other answers and just use Camel Tools. (Nizar Habash and his team at NYU Abu Dhabi are awesome for developing this and making it so accessible!)

::python
from camel_tools.utils.charmap import CharMapper
sentence = "ذهبت إلى المكتبة."
print(sentence)

ar2bw = CharMapper.builtin_mapper('ar2bw')

sent_bw = ar2bw(sentence)
print(sent_bw)

输出:

هبت إلى المكتبة.
*hbt <lY Almktbp.

您可以在此处找到安装说明和教程:https://github.com/CAMeL-Lab/骆驼工具

You can find install instructions and tutorials here: https://github.com/CAMeL-Lab/camel_tools

旧答案顺便说一句,有人已经编写了一个脚本来执行此操作,因此您可能需要在自己花费太多时间之前检查一下:buckwalter2unicode.py

Old answer Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py

它可能做的比你需要的更多,但你不必全部使用它:我只复制了两个字典和 transliterateString 函数(我认为做了一些调整),然后在我的网站上使用它.

It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

上面的脚本是我一直在使用的,但我刚刚发现它比使用替换要慢,特别是对于大型语料库.这是我最终得到的代码,它似乎更简单、更快(这里引用了字典 buck2uni):

The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string

这篇关于使用 Python 快速转写阿拉伯语文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆