用Java排序阿拉伯语单词 [英] Sorting Arabic words in Java

查看:248
本文介绍了用Java排序阿拉伯语单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一份阿拉伯语单词列表,我想排序。我已尝试使用不同语言环境的标准Collat​​or(如英语或法语但没有太多希望),我甚至创建了自己的RuleBasedCollat​​or但无济于事。显然,默认排序依赖于unicode值顺序,在许多情况下它可以工作但显然不在这个顺序中。



按照javadocs的说明,RuleBasedCollat​​or需要一个字符串按照您希望它们排序的顺序指定字符。我使用此表中的unicode代码创建了以下字符串:

 字符串arabicLetters =< \\\أ = \\ FEFE83 = \\\ uFE84< \\\ب = \ uFE8F = \ uFE90 = \\ \\ uFE92 = \ uFE91< \\\ت = \\FEFE = \ uFE96 = \ uFE98 = \ uFE97< \\\ث = \ uFE99 = \\\ uFE9A = \\ uFE9C = \ uFE9B< \\\ج = \\\ﺝ = \\\ﺞ = \\\ﺠ = \\\ﺟ< \\\ح = \\\ﺡ = \\\ﺢ = \\\ﺤ = \\\ﺣ< \\\خ = \\\ﺥ = \\\ﺦ = \\\ﺨ = \\\ﺧ< \\\د = \\\ﺩ = \\\ﺪ< \\\ذ = \\\ﺫ = \\\ﺬ< \\\ر = \\\ﺭ = \\\ﺮ < \\\ز = \\\ﺯ = \\\ﺰ< \\\س = \\\ﺱ = \\\ﺲ = \\\ﺴ = \\\ﺳ< \\\ش = \\\ﺵ = \\\ﺶ = \ uFEB8 = \ uFEB7< \\\ص = \\FEFEB9 = \\ FEFEBA = \ uF EBC = \\\ﺻ< \\\ض = \\\ﺽ = \\\ﺾ = \\\ﻀ = \\\ﺿ< \\\ط = \\\ﻁ = \\\ﻂ = \\\ﻄ = \\\ﻃ< \\ \ظ = \\\ﻅ = \\\ﻆ = \\\ﻈ = \\\ﻇ< \\\ع = \\\ﻉ = \\\ﻊ = \\\ﻌ = \\\ﻋ< \\\غ = \\\ﻍ = \\ \\ uFECE = \\\ﻐ = \\\ﻏ< \\\ف = \\\ﻑ = \\\ﻒ = \\FED4 = \\\FFED3< \\\ق = \\\ﻕ = \\\ﻖ = \\\FFED8 = \\\FFED7< \\\ك = \\\ﻙ = \\\ﻚ = \\FEDC = \\\FFEDB< \\\ل = \\\FEDD = \\\FFEDE = \\\FFED0 = \\\FFFF< \\\م = \\\ﻡ = \\\ﻢ = \\\ﻤ = \\\ﻣ< \\\ن = \\\ﻥ = \\\ﻦ = \\\ﻨ = \\\ﻧ< \\\ه = \\\ﻩ = \\\ﻪ = \\\\\\\\\\ = \\\و = \\\ FEFEED = \\\ﻮ< \\\ي = \\FEFE1 = \\ FEFF = \\\\\\\\\\\ \\\آ = \\\FEFE81 = \\\ FEF82< \\\ة = \\\ﺓ = \ uFE94< \\\ى = \\\ﻯ = \\\ FEFF0< \ u0627; 

阿拉伯字母可以采用四种形式,具体取决于它们在因此,我在上面的规则字符串中所做的是使每个字母的所有4种形式相等。然后我用<来表示用它们分隔它们的字母的顺序。我想这是正确的方法。



现在,如果我有一个星期几的集合(在这种情况下按星期几排序,而不是按字母顺序排序):

 الأحد,الاثنين,الثلاثاء,الأربعاء,الخميس,الجمعة,السبت

我得到的结果根本没有排序:

 الأحد,الخميس,الاثنين ,另外,

$ p>
时间量w hich使它无法使用。



有人知道我做错了什么,或者是否有一个救生库已经处理了这个?



在写这篇文章之前我做了一些谷歌搜索,我很惊讶我没有找到一个结果。



谢谢!






更新代码:

 公共静态类TranslatableComparator实现了java.util.Comparator< Translatable> {
@Override
public int compare(可翻译t1,可翻译t2){

字符串sortingRules =< \\\أ = \ uFE83 = \ uFE84< \\ \\ u0628 = \ uFE8F = \ uFE90 = \ uFE92 = \\\ uFE91< \\\ت = \\\ uFE95 = \ uFE96 = \\\\\\\\\\ = = \ uFE97< \\\ u0 \ﺚ = \\\ﺜ = \\\ﺛ< \\\ج = \\\ﺝ = \\\ﺞ = \\\ﺠ = \\\ﺟ< \\\ح = \\\ﺡ = \\\ﺢ = \\\ﺤ = \\ \ﺣ< \\\خ = \\\ﺥ = \\\ﺦ = \\\ﺨ = \\\ﺧ< \\\د = \\\ﺩ = \\\ﺪ< \\\ذ = \\\ﺫ = \\\ﺬ< ; \\\ر = \\\ﺭ = \\\ﺮ< \\\ز = \\\ﺯ = \\\ﺰ< \\\س = \\\ﺱ = \\\ﺲ = \\\ﺴ = \\\ﺳ< \ u0634 = \\\ﺵ = \\\ﺶ = \\\ﺸ = \\\ﺷ< \\\ص = \\\ﺹ = \\\ﺺ = \\\ﺼ = \\\ﺻ< \\\ض = \\\ﺽ = \ uFEBE = \\\ﻀ = \ uFEBF< \ط = \\\ﻁ = \\\ﻂ = \\\ﻄ = \\\ﻃ< \\\ظ = \\\ﻅ = \\\ﻆ = \\\ﻈ = \\\ﻇ< \\\ع = \\\ﻉ = \\ \ﻊ = \\\ﻌ = \\\ﻋ< \\\غ = \\\ﻍ = \\\ﻎ = \\\ﻐ = \\\ﻏ< \\\ف = \\\ﻑ = \\\ﻒ = \\\ﻔ = \\ \\ uFED3< \\\ق = \\\FFED5 = \\\FFED6 = \\\FFED8 = \\\FFED7< \\\ك = \\\ﻙ = \\\ﻚ = \\FEDC = \\\FFEDB< \\\ل = \\\FEDD = \\\FFEDE = \\\FFED0 = \\\FFFF< \\\م = \\\ﻡ = \\\ﻢ = \\\ﻤ = \\\ﻣ< \\\ن = \\\ﻥ = \\\ﻦ = \\\ﻨ = \\\ﻧ< \\\ه = \\\ﻩ = \\\ﻪ = \\\\\\\\\\ = \\\و = \\\ FEFEED = \\\ﻮ< \\\ي = \\FEFE1 = \\ FEFF = \\\\\\\\\\\ \\\آ = \\\FEFE81 = \\\ FEF82< \\\ة = \\\ﺓ = \ uFE94< \\\ى = \\\ﻯ = \\\ FEFF0< \\\ا;
RuleBasedCollat​​or col = null;
try {
col = new RuleBasedCollat​​or(sortingRules);
} catch(ParseException e){
// col =(RuleBasedCollat​​or)RuleBasedCollat​​or.getInstance(Locale.FRENCH);
}

返回col.getCollat​​ionKey(t1.getTranslation()。getText())。compareTo(col.getCollat​​ionKey(t2。 getTranslation()。getText()));
}
}


解决方案

你不需要定义你自己的整理器,只需使用内置的阿拉伯语。你的比较器然后看起来像这样

  public int compare(可翻译的t1,可翻译的t2){
Collat​​or.getInstance(new Locale(ar)) .compare(t1.getTranslation()。getText(),t2.getTranslation()。getText());
}

(您可以通过浏览结果来检查整理器是否可用于阿拉伯语来自 Collat​​or.getAvailableLocales()。)



如评论中所述,如果您担心表现你应该计算校对密钥,将它们存储在可翻译的对象中,然后对它们进行排序。



如果你真的想要查看您定义的内容与标准整理器之间的差异,只需打印出规则:

  System.out。 println((RuleBasedCollat​​or)Collat​​or.getInstance(new Locale(ar)))。getRules(); 


I have a list of words in Arabic that I'd like to sort. I have tried the standard Collator with different Locales (like English or French but without much hope) and I have even created my own RuleBasedCollator but to no avail. Apparently the default sorting relies on the unicode values order, which in many cases works but apparently not in this one.

Following the instructions of the javadocs, the RuleBasedCollator requires a string specifying the characters in the order you want them sorted. I created the following string taking the unicode codes from this table:

String arabicLetters = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";

The Arabic letters can take four forms depending on the position where they are in a word. Therefore what I did in the rules string above is making equal all 4 forms of each letter. Then I indicate the order of the letters separating them with '<'. I imagine that this is the correct way to do it.

Now, if I have a collection with the days of the week (sorted in this case by day of the week, not 'alphabetically'):

الأَحَد, الاِثنَين, الثُّلاثاء, الأَربِعاء, الخَميس, الجُمعة,السَّبت

The results I am getting are not sorted at all:

الأَحَد, الخَميس, الاِثنَين, الثُّلاثاء, الأَربِعاء, السَّبت, الجُمعة

Besides, for such a small amount of words it takes a considerable amount of time which makes it unusable.

Does anybody know if I'm doing something wrong or if there is a life-saving library that already handles this?

I did some googling before writing this and I'm surprised I didn't find a single result.

Thanks!


UPDATE with code:

public static class TranslatableComparator implements java.util.Comparator<Translatable> {
        @Override
        public int compare(Translatable t1, Translatable t2) {

            String sortingRules = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";
            RuleBasedCollator col = null;
            try {
                col = new RuleBasedCollator(sortingRules);
            } catch (ParseException e) {
                //col = (RuleBasedCollator)RuleBasedCollator.getInstance(Locale.FRENCH);
            }

            return col.getCollationKey(t1.getTranslation().getText()).compareTo(col.getCollationKey(t2.getTranslation().getText()));
        }
    }

解决方案

You don't need to define your own collator, just use the built-in one for Arabic. Your Comparator then looks like this

public int compare(Translatable t1, Translatable t2) {
        Collator.getInstance(new Locale("ar")).compare(t1.getTranslation().getText(), t2.getTranslation().getText());
}

(You can check if a collator is available for Arabic by browsing the results from Collator.getAvailableLocales().)

As noted in the comments, if you're worried about performance you should calculate the collation keys, store them in your Translatable objects and sort on them instead.

If you really want to see where the differences are between what you defined and the standard collator, just print out the rules:

System.out.println((RuleBasedCollator) Collator.getInstance(new Locale("ar"))).getRules();

这篇关于用Java排序阿拉伯语单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆