我怎么能在java中从字符串中删除阿拉伯标点符号 [英] how could i remove arabic punctuation form a String in java
问题描述
我正在编写一本阿拉伯语词典,我得到的句子是
String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";从我的数据库中,但我无法在不删除重音和标点符号的情况下处理句子
i am working on an arabic dictionary and i am getting sentences like
String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
from my database but i cant process the sentence without removing the accents and punctuation
我尝试使用
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public static String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
但是没用
推荐答案
为什么不直接使用 Unicode 标点符号/标记、非间距类别?
Why don't you just go for the Unicode punctuation / mark, non-spacing categories?
不确定您的预期结果,因为它没有发布 - 而且我看不懂阿拉伯语:),但试试这个代码:
Not sure of your expected result as it's not posted - and I can't read Arabic :), but try this code:
String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("found: " + m.group());
}
m.reset();
System.out.println("Replaced: " + m.replaceAll(" "));
输出:
found: '
found: َ
found: َ
found: َ
found: ُ
found: ً
found: :
found: َ
found: َ
found: َ
found: َ
found: َ
found: ّ
found: َ
found: َ
found: .
found: '
Replaced: أ ب ن ف لان ا ع اب ه ور م اه بخ ل ة س وء
我想这不是您想要的最终结果,但我希望您可以使用它.
I suppose it's not your desired final result, but I hope it's something you can work with.
此外,这个是 Unicode 信息的金矿类别.我相信大多数都适用于 Java Pattern
.
Also, this is a gold mine of information on the Unicode categories. I believe most are applicable in a Java Pattern
.
这篇关于我怎么能在java中从字符串中删除阿拉伯标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!