如何使用正则表达式从字符串中仅检索阿拉伯文本? [英] How to retrieve only arabic texts from a string using regular expression?
问题描述
我有一个包含阿拉伯语和英语句子的字符串.我想要的是只提取阿拉伯语句子.
my_string="""是什么原因ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ这背后?ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ"""
此链接显示阿拉伯字母的 Unicode 范围是 0600-06FF
.
所以,我想到的非常基本的尝试是:
导入重新打印 re.findall(r'[\u0600-\u06FF]+',my_string)
但是,这很糟糕,因为它返回以下列表.
['What', 'is', 'the', 'reason', 'behind', 'this?']
如您所见,这与我想要的完全相反.我在这里缺少什么?
注意
我知道我可以使用如下反向匹配来匹配阿拉伯字母:
print re.findall(r'[^a-zA-Z\s0-9]+',my_string)
但是,我不想那样.
您可以使用 re.sub
用空字符串替换 ascii 字符.
您的正则表达式不起作用,因为您使用的是 Python 2 并且您的字符串是 str
您需要将 my_string
转换为 unicode 才能使其工作.但是它在 Python3.x 上运行得很好
I have a string which has both Arabic and English sentences. What I want is to extract Arabic Sentences only.
my_string="""
What is the reason
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
behind this?
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
"""
This Link shows that the Unicode range for Arabic letters is 0600-06FF
.
So, very basic attempt came to my mind is:
import re
print re.findall(r'[\u0600-\u06FF]+',my_string)
But, this fails miserably as it returns the following list.
['What', 'is', 'the', 'reason', 'behind', 'this?']
As you can see, this is exactly opposite of what I want. What I am missing here?
N.B.
I know I can match the Arabic letters by using inverse matching like below:
print re.findall(r'[^a-zA-Z\s0-9]+',my_string)
But, I don't want that.
You can use re.sub
to replace ascii characters with empty string.
>>> my_string="""
... What is the reason
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... behind this?
... ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
ذَلِكَ الْكِتَابُ لَا رَيْبَ فِيهِ هُدًى لِلْمُتَّقِينَ
Your regex didn't work because you are using Python 2 and your string is str
you need to convert my_string
to unicode for it to work. However it did perfectly work on Python3.x
>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
ذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَذَلِكَالْكِتَابُلَارَيْبَفِيهِهُدًىلِلْمُتَّقِينَ
这篇关于如何使用正则表达式从字符串中仅检索阿拉伯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!