Android的,在日文字符的文件名比较问题 [英] Android, problem with file name comparison in Japanese characters

查看:302
本文介绍了Android的,在日文字符的文件名比较问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想,以配合在Android上使用递归的目录搜索文件名的搜索字符串。问题是,该字符是日文,而且它不会在某些情况下的匹配。例如,搜索字符串我试图用文件名的开头匹配是呼ぶ。当我打印文件名,从file.getName(),这是正确地反映,例如打印到控制台的文件名以呼ぶ。但是,当我做一个匹配的搜索字符串,如fileName.startwith(呼ぶ),它不匹配。

I'm trying to match a search string with a file name with a recursive directory search on Android. The problem is that the characters are Japanese, and it's not matching in some cases. For example, the search string I'm trying to match the start of the file name with is "呼ぶ". When I print the file names, from file.getName(), this is accurately reflected, e.g. the file name printed to the console starts with "呼ぶ". But when I do a match on the search string, e.g. fileName.startwith("呼ぶ"), it doesn't match.

原来,当我打印的文件名的子字符串被搜索,第二个字符是不同的 - 这个词是呼ふ而不是呼ぶ。如果我提取字节和打印十六进制字符,最后一个字节是关闭的1 - presumablyぶ和ふ之间的区别

It turns out that when I print the substring of the file name being searched, the second character is different – the word is "呼ふ" instead of "呼ぶ". If I extract the bytes and print the hex characters, the last byte is off by 1 – presumably the difference between "ぶ" and "ふ".

下面是用于以示区别的code:

Here is the code used to show the difference:

    String name = soundFile.getName();
    String string1 = question.kanji;


    Log.d(TAG, "searching for : s1:" + question.kanji + " + " + question.hiragana + " + " + question.english);
    Log.d(TAG, "name is: " + name);

    Log.d(TAG, "question.kanaji.length(): " + question.kanji.length());
    Log.d(TAG, "question.hiragana.length(): " + question.hiragana.length());


    String compareStart = name.substring(0, string1.length() );

    Log.d(TAG, "string1.length(): " + string1.length());
    Log.d(TAG, "compareStart.length(): " + compareStart.length());      

        byte[] nameUTF8 = null; 
    byte[] s1UTF8 = null;
    byte[] csUTF8 = null;

    nameUTF8 = name.getBytes();
    s1UTF8 = string1.getBytes();
    csUTF8 = compareStart.getBytes();


    Log.d(TAG, "nameUTF8.length: " + s1UTF8.length);            
    Log.d(TAG, "s1UTF8.length: " + s1UTF8.length);
    Log.d(TAG, "csUTF8.length: " + csUTF8.length);

    for (int i = 0; i < s1UTF8.length; i++) {
        Log.d(TAG, "s1UTF8[i]: " + Integer.toString(s1UTF8[i] & 0xff, 16).toUpperCase());
    }

    for (int i = 0; i < csUTF8.length; i++) {
        Log.d(TAG, "csUTF8[i]: " + Integer.toString(csUTF8[i] & 0xff, 16).toUpperCase());
    }

    for (int i = 0; i < nameUTF8.length; i++) {
        Log.d(TAG, "nameUTF8[i]: " + Integer.toString(nameUTF8[i] & 0xff, 16).toUpperCase());
    }

的部分输出如下:

The partial output is as follows:

D/AnswerView(12078): searching for : s1:呼ぶ + よぶ + to call out,to invite
D/AnswerView(12078): name is: 呼ぶ                                                     よぶ                 to call out,to invite.mp3
D/AnswerView(12078): question.kanaji.length(): 2
D/AnswerView(12078): question.hiragana.length(): 2
D/AnswerView(12078): string1: 呼ぶ
D/AnswerView(12078): compareStart: 呼ふ
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): nameUTF8.length: 6
D/AnswerView(12078): s1UTF8.length: 6
D/AnswerView(12078): csUTF8.length: 6
D/AnswerView(12078): s1UTF8[i]: E5
D/AnswerView(12078): s1UTF8[i]: 91
D/AnswerView(12078): s1UTF8[i]: BC
D/AnswerView(12078): s1UTF8[i]: E3
D/AnswerView(12078): s1UTF8[i]: 81
D/AnswerView(12078): s1UTF8[i]: B6
D/AnswerView(12078): csUTF8[i]: E5
D/AnswerView(12078): csUTF8[i]: 91
D/AnswerView(12078): csUTF8[i]: BC
D/AnswerView(12078): csUTF8[i]: E3
D/AnswerView(12078): csUTF8[i]: 81
D/AnswerView(12078): csUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E5
D/AnswerView(12078): nameUTF8[i]: 91
D/AnswerView(12078): nameUTF8[i]: BC
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 81
D/AnswerView(12078): nameUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 82
D/AnswerView(12078): nameUTF8[i]: 99
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20

显示该文件名的所提取的串,以及文件名本身的第六字节,是B5,而不是B6,因为它是在搜索字符串。但是,正确显示打印的文件名。我难倒。为什么文件名正确显示在控制台当底层人物有什么不同?为什么有额外的3个非空白字节的文件名的开头 - 这在某种程度上并不需要在搜索字符串重新present了ぶ字?

Showing that the sixth byte of the extracted substring of the file name, as well as the file name itself, is "B5" instead of "B6" as it is in the search string. However, the printed file name is correctly displayed. I'm stumped. Why is the file name being correctly displayed to the console when the underlying characters are different? Why are there an additional 3 non-blank bytes at the beginning of the file name - which somehow aren't needed in the search string to represent the "ぶ" character?

推荐答案

这个问题看起来是正常化的形式之一。我知道,在Mac上,例如,文件系统始终处于NFD。但是,您发布的字符串在NFC。看点:

The problem looks to be one of normalization forms. I know that on a Mac, for example, the filesystem is always in NFD. But the string you posted is in NFC. Watch:

% cat /tmp/u
呼ぶ

% uwc /tmp/u
   Paras    Lines    Words   Graphs    Chars    Bytes File
       0        1        1        3        3        7 /tmp/u

% uniquote -v  /tmp/u
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER BU}

% nfd /tmp/u | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER HU}\N{COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK}

% nfc /tmp/u | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER BU}

所以我认为你将不得不考虑转换为NFD

So I think you are going to have to think about converting to NFD.

BTW,即U + 547C CJK code点恰好是这起永硕数据库:

BTW, that U+547C CJK code point happens to be this from the Unihan database:

 呼 U+547C Lo Han    CJK UNIFIED IDEOGRAPH-547C
  Mandarin     hu1 xu1
  Cantonese    fu1
  JapaneseKun  yobu
  JapaneseOn   ko
  Korean       ho
  HanyuPinlu   hu1(378) hu5(107)
  Vietnamese   hô

这篇关于Android的,在日文字符的文件名比较问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆