从java中的给定文本中提取阿拉伯语短语 [英] Extract Arabic phrases from a given text in java
问题描述
你能帮我找到一个包含短语列表的正则表达式并检查给定文本中是否存在这些短语之一吗?
Can you help me in finding a regex that take list of phrases and check if one of these phrases exist in the given text, please?
示例:
如果我在hashSet
中有以下的话:
If I have in the hashSet
the following words:
كيف الحال
إلى أين
أين يوجد
هل من أحد هنا
给定的文本是:كيف الحال أتمنى أن تكون بخير
我想在执行正则表达式后得到:كيف الحال
I want to get after performing regex: كيف الحال
我的初始代码:
HashSet<String> QWWords = new HashSet<String>();
QWWords.add("كيف الحال");
QWWords.add("إلى أين");
QWWords.add("أين يوجد");
QWWords.add("هل من أحد هنا");
String s1 = "كيف الحال أتمنى أن تكون بخير";
for (String qp : QWWords) {
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
Matcher m = p.matcher(s1);
String found = "";
while (m.find()) {
found = m.group();
System.out.println(found);
}
}
推荐答案
[...]
是 字符类 和字符类只能匹配它指定的一个字符.例如像 [abc]
这样的字符类只能匹配 a
OR b
OR c
.因此,如果您只想找到单词 abc
,请不要用 [...]
将其括起来.
[...]
is character class and character class can match only one character it specifies. For instance character class like [abc]
can match only a
OR b
OR c
. So if you want to find only word abc
don't surround it with [...]
.
另一个问题是您使用 \\s
作为单词分隔符,因此在以下字符串中
Another problem is that you are using \\s
as word separator, so in following String
String data = "foo foo foo foo";
regex \\sfoo\\s
将无法匹配第一个 foo
因为前面没有空格.
所以它会找到的第一个匹配是
regex \\sfoo\\s
will not be able to match first foo
because there is no space before.
So first match it will find will be
String data = "foo foo foo foo";
// this one--^^^^^
现在,由于正则表达式在第二个 foo
之后消耗了空间,它不能在下一场比赛中重用它,所以第三个 foo
也将被跳过,因为没有可用空间来匹配在它之前.
您也不会匹配 foo
,因为这次 后面 没有空格.
Now, since regex consumed space after second foo
it can't reuse it in next match so third foo
will also be skipped because there is no space available to match before it.
You will also not match forth foo
because this time there is no space after it.
要解决这个问题,您可以使用 \\b
- word边界检查它所代表的位置是否在字母数字和非字母数字字符(或字符串的开始/结束)之间.
To solve this problem you can use \\b
- word boundary which checks if place it represents is between alphanumeric and non-alphanumeric characters (or start/end of string).
所以代替
Pattern p = Pattern.compile("[\\s" + qp + "\\s]");
使用
Pattern p = Pattern.compile("\\b" + qp + "\\b");
或者像 蒂姆提到的更好
Pattern p = Pattern.compile("\\b" + qp + "\\b",Pattern.UNICODE_CHARACTER_CLASS);
确保 \\b
将在预定义的字母数字类中包含阿拉伯字符.
to make sure that \\b
will include Arabic characters in predefined alphanumeric class.
更新:
我不确定您的话是否可以包含正则表达式元字符,例如 {
[
+
*
等等,以防万一您还可以添加转义机制以将此类字符更改为文字.
I am not sure if your words can contain regex metacharacters like {
[
+
*
and so on, so just in case you can also add escaping mechanism to change such characters into literals.
所以
"\\b" + qp + "\\b"
可以变成
"\\b" + Pattern.quote(qp) + "\\b"
这篇关于从java中的给定文本中提取阿拉伯语短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!