从字符串中提取阿拉伯语单词(不是语义阿拉伯语短语) [英] Extracting Arabic words(not semantic arabic phrases) from a string

查看:393
本文介绍了从字符串中提取阿拉伯语单词(不是语义阿拉伯语短语)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

String description="Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. البيانات الضخمة هي عبارة عن مجموعة من مجموعة البيانات الضخمة جداً والمعقدة لدرجة أنه يُصبح من الصعب معالجتها باستخدام أداة واحدة فقط من أدوات إدارة قواعد البيانات أو باستخدام تطبيقات معالجة البيانات التقليدية. "

我需要一个正则表达式来仅提取阿拉伯语单词.

I need a regex to extract only arabic words .

我检查了这张票,但是,这是一张PHP票,而我需要JAVA正则表达式.

I check this ticket , however , it is a PHP ticket , while , i need JAVA regex .

import java.util.regex.*;
Pattern p = Pattern.compile("#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u");
print(p.matcher(description).group(1));

它引发一个错误.

推荐答案

要查找一个或多个阿拉伯字符,可以使用\p{InArabic}+

To find one or more Arabic characters you can use \p{InArabic}+

模式未直接提及此类文档,但它为我们提供了有关

This class is not mentioned directly by Pattern documentation, but it gives us informations about

Unicode脚本,块,类别和二进制属性的类
\p{IsLatin}拉丁字母字符(脚本)
\p{InGreek}希腊语块中的字符(阻止)
\p{Lu}大写字母(类别)
\p{IsAlphabetic}字母字符(二进制属性)

Classes for Unicode scripts, blocks, categories and binary properties
\p{IsLatin} A Latin script character (script)
\p{InGreek} A character in the Greek block (block)
\p{Lu} An uppercase letter (category)
\p{IsAlphabetic} An alphabetic character (binary property)

\p{InGreek}的示例的鼓励下,我们可以开始阅读有关块的知识,以找到

and encouraged by example of \p{InGreek} we can start reading about blocks, to find that

是用前缀In指定的,例如在InMongolian中,或者通过使用关键字block(或其缩写形式blk)在block=Mongolianblk=Mongolian.

Blocks are specified with the prefix In, as in InMongolian, or by using the keyword block (or its short form blk) as in block=Mongolian or blk=Mongolian.

Pattern支持的块名称是

The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

最后一句话对我们来说最重要.现在,我们需要查看UnicodeBlocks是否应支持阿拉伯字符组.因此,我们访问其文档可以找到字段

That last sentence is most important for us. Now we need to see if UnicodeBlocks should support group of Arabic characters. So we visit its documentation where we can find field

public static final Character.UnicodeBlock ARABIC

这表示支持阿拉伯字符块.

which suggest that there is support for Arabic characters block.

因此,要查找单个阿拉伯语单词,您的代码应类似于:

So to find single Arabic words your code can look like:

String description="Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. البيانات الضخمة هي عبارة عن مجموعة من مجموعة البيانات الضخمة جداً والمعقدة لدرجة أنه يُصبح من الصعب معالجتها باستخدام أداة واحدة فقط من أدوات إدارة قواعد البيانات أو باستخدام تطبيقات معالجة البيانات التقليدية. ";
Pattern p = Pattern.compile("\\p{InArabic}+";
Matcher m = p.matcher(description);
while(m.find()){
    System.out.println(m.group());
}

输出:

البيانات
الضخمة
هي
.
.
.
البيانات
التقليدية

如果要查找由一个或多个空格分隔的阿拉伯语单词组,则可以使用此模式

If you want to find groups of Arabic words separated by one or more whitespace you can this pattern

Pattern p = Pattern.compile("\\p{InArabic}+(?:\\s+\\p{InArabic}+)*");

您可能想知道*-表示零或多个,而+-一个或多个

You may want to know that * - represents zero or more, and + - one or more

因此此正则表达式表示

\\p{InArabic}+     # one or more Arabic characters (Arabic word)
(?:                # non-capturing group storing:
  \\s+             # one or more whitespace characters
  \\p{InArabic}+   # with another Arabic word after it
)*                 # zero or more times

这篇关于从字符串中提取阿拉伯语单词(不是语义阿拉伯语短语)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆