如何使用正则表达式从字符串中仅检索阿拉伯文本? [英] How to retrieve only arabic texts from a string using regular expression?

查看:59
本文介绍了如何使用正则表达式从字符串中仅检索阿拉伯文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含阿拉伯语和英语句子的字符串.我想要的是只提取阿拉伯语句子.

my_string="""是什么原因ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ这背后?ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ"""

此链接显示阿拉伯字母的 Unicode 范围是 0600-06FF.

所以,我想到的非常基本的尝试是:

导入重新打印 re.findall(r'[\u0600-\u06FF]+',my_string)

但是,这很糟糕,因为它返回以下列表.

['What', 'is', 'the', 'reason', 'behind', 'this?']

如您所见,这与我想要的完全相反.我在这里缺少什么?

注意

我知道我可以使用如下反向匹配来匹配阿拉伯字母:

print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

但是,我不想那样.

解决方案

您可以使用 re.sub 用空字符串替换 ascii 字符.

<预><代码>>>>my_string="""... 是什么原因... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ……这背后?... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ……">>>打印(re.sub(r'[a-zA-Z?]', '', my_string).strip())ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

您的正则表达式不起作用,因为您使用的是 Python 2 并且您的字符串是 str 您需要将 my_string 转换为 unicode 才能使其工作.但是它在 Python3.x 上运行得很好

<预><代码>>>>打印 "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))ذلكالكتابلاريبفيههدىللمتقينذلكالكتابلاريبفيههدىللمتقين

I have a string which has both Arabic and English sentences. What I want is to extract Arabic Sentences only.

my_string="""
What is the reason
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
behind this?
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
"""

This Link shows that the Unicode range for Arabic letters is 0600-06FF.

So, very basic attempt came to my mind is:

import re
print re.findall(r'[\u0600-\u06FF]+',my_string)

But, this fails miserably as it returns the following list.

['What', 'is', 'the', 'reason', 'behind', 'this?']

As you can see, this is exactly opposite of what I want. What I am missing here?

N.B.

I know I can match the Arabic letters by using inverse matching like below:

print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

But, I don't want that.

解决方案

You can use re.sub to replace ascii characters with empty string.

>>> my_string="""
... What is the reason
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... behind this?
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ

Your regex didn't work because you are using Python 2 and your string is str you need to convert my_string to unicode for it to work. However it did perfectly work on Python3.x

>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ

这篇关于如何使用正则表达式从字符串中仅检索阿拉伯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆