在 Java 中对阿拉伯语单词进行排序 [英] Sorting Arabic words in Java

查看:27
本文介绍了在 Java 中对阿拉伯语单词进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个阿拉伯语单词列表,我想对其进行排序.我已经尝试了不同语言环境的标准 Collat​​or(如英语或法语,但没有太大希望),我什至创建了自己的 RuleBasedCollat​​or 但无济于事.显然,默认排序依赖于 unicode 值顺序,这在许多情况下有效,但显然在这种情况下无效.

I have a list of words in Arabic that I'd like to sort. I have tried the standard Collator with different Locales (like English or French but without much hope) and I have even created my own RuleBasedCollator but to no avail. Apparently the default sorting relies on the unicode values order, which in many cases works but apparently not in this one.

按照 javadocs 的说明,RuleBasedCollat​​or 需要一个字符串,指定您希望字符排序的顺序.我使用 this table 中的 unicode 代码创建了以下字符串:

Following the instructions of the javadocs, the RuleBasedCollator requires a string specifying the characters in the order you want them sorted. I created the following string taking the unicode codes from this table:

String arabicLetters = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";

阿拉伯字母可以采用四种形式,具体取决于它们在单词中的位置.因此,我在上面的规则字符串中所做的是使每个字母的所有 4 种形式相等.然后我用'<'指示分隔它们的字母顺序.我想这是正确的做法.

The Arabic letters can take four forms depending on the position where they are in a word. Therefore what I did in the rules string above is making equal all 4 forms of each letter. Then I indicate the order of the letters separating them with '<'. I imagine that this is the correct way to do it.

现在,如果我有一个包含星期几的集合(在这种情况下按星期几排序,而不是按字母顺序"):

Now, if I have a collection with the days of the week (sorted in this case by day of the week, not 'alphabetically'):

الأَحَد, الاِثنَين, الثُّلاثاء, الأَربِعاء, الخَميس, الجُمعة,السَّبت

我得到的结果根本没有排序:

The results I am getting are not sorted at all:

الأَحَد, الخَميس, الاِثنَين, الثُّلاثاء, الأَربِعاء, السَّبت, الجُمعة

此外,对于这么少的单词,需要花费大量时间,因此无法使用.

Besides, for such a small amount of words it takes a considerable amount of time which makes it unusable.

有谁知道我是否做错了什么,或者是否有一个救生图书馆已经处理了这个问题?

Does anybody know if I'm doing something wrong or if there is a life-saving library that already handles this?

我在写这篇文章之前做了一些谷歌搜索,很惊讶我没有找到一个结果.

I did some googling before writing this and I'm surprised I didn't find a single result.

谢谢!

使用代码更新:

public static class TranslatableComparator implements java.util.Comparator<Translatable> {
        @Override
        public int compare(Translatable t1, Translatable t2) {

            String sortingRules = "< \u0623=\uFE83=\uFE84 < \u0628=\uFE8F=\uFE90=\uFE92=\uFE91 < \u062A=\uFE95=\uFE96=\uFE98=\uFE97 < \u062B=\uFE99=\uFE9A=\uFE9C=\uFE9B < \u062C=\uFE9D=\uFE9E=\uFEA0=\uFE9F < \u062D=\uFEA1=\uFEA2=\uFEA4=\uFEA3 < \u062E=\uFEA5=\uFEA6=\uFEA8=\uFEA7 < \u062F=\uFEA9=\uFEAA < \u0630=\uFEAB=\uFEAC < \u0631=\uFEAD=\uFEAE < \u0632=\uFEAF=\uFEB0 < \u0633=\uFEB1=\uFEB2=\uFEB4=\uFEB3 < \u0634=\uFEB5=\uFEB6=\uFEB8=\uFEB7 < \u0635=\uFEB9=\uFEBA=\uFEBC=\uFEBB < \u0636=\uFEBD=\uFEBE=\uFEC0=\uFEBF < \u0637=\uFEC1=\uFEC2=\uFEC4=\uFEC3 < \u0638=\uFEC5=\uFEC6=\uFEC8=\uFEC7 < \u0639=\uFEC9=\uFECA=\uFECC=\uFECB < \u063A=\uFECD=\uFECE=\uFED0=\uFECF < \u0641=\uFED1=\uFED2=\uFED4=\uFED3 < \u0642=\uFED5=\uFED6=\uFED8=\uFED7 < \u0643=\uFED9=\uFEDA=\uFEDC=\uFEDB < \u0644=\uFEDD=\uFEDE=\uFED0=\uFEDF < \u0645=\uFEE1=\uFEE2=\uFEE4=\uFEE3 < \u0646=\uFEE5=\uFEE6=\uFEE8=\uFEE7 < \u0647=\uFEE9=\uFEEA=\uFEEC=\uFEEB < \u0648=\uFEED=\uFEEE < \u064A=\uFEF1=\uFEF2=\uFEF4=\uFEF3 < \u0622=\uFE81=\uFE82 < \u0629=\uFE93=\uFE94 < \u0649=\uFEEF=\uFEF0 < \u0627";
            RuleBasedCollator col = null;
            try {
                col = new RuleBasedCollator(sortingRules);
            } catch (ParseException e) {
                //col = (RuleBasedCollator)RuleBasedCollator.getInstance(Locale.FRENCH);
            }

            return col.getCollationKey(t1.getTranslation().getText()).compareTo(col.getCollationKey(t2.getTranslation().getText()));
        }
    }

推荐答案

您无需定义自己的校对器,只需使用内置的阿拉伯语校对器即可.你的 Comparator 然后看起来像这样

You don't need to define your own collator, just use the built-in one for Arabic. Your Comparator then looks like this

public int compare(Translatable t1, Translatable t2) {
        Collator.getInstance(new Locale("ar")).compare(t1.getTranslation().getText(), t2.getTranslation().getText());
}

(您可以通过浏览 Collat​​or.getAvailableLocales() 的结果来检查阿拉伯语的整理器是否可用.)

(You can check if a collator is available for Arabic by browsing the results from Collator.getAvailableLocales().)

如评论中所述,如果您担心性能,您应该计算整理键,将它们存储在您的 Translatable 对象中,然后对它们进行排序.

As noted in the comments, if you're worried about performance you should calculate the collation keys, store them in your Translatable objects and sort on them instead.

如果你真的想看看你定义的和标准的校对者之间的区别在哪里,只需打印出规则:

If you really want to see where the differences are between what you defined and the standard collator, just print out the rules:

System.out.println((RuleBasedCollator) Collator.getInstance(new Locale("ar"))).getRules();

这篇关于在 Java 中对阿拉伯语单词进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆