在java中是否有一个与// TRANSLIT等效的iconv? [英] Is there an iconv with //TRANSLIT equivalent in java?

查看:373
本文介绍了在java中是否有一个与// TRANSLIT等效的iconv?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法在java中的字符集之间实现音译字符?类似于unix命令(或类似的php函数)的东西:

Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):

iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt  > new_doc.txt

最好是在字符串上操作,与文件没有任何关系

preferably operating on strings, not having anything to do with files

我知道您可以使用 String 构造函数更改编码,但这不能处理结果中不包含的字符的音译charset。

I know you can can change encodings with the String constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.

推荐答案

我不知道有哪些库完全符合 iconv 声称要做(似乎没有很明确的定义)。但是,您可以使用Java中的规范化来执行此类操作从字符中删除重音符号。这个过程由Unicode标准很好地定义。

I'm not aware of any libraries that do exactly what iconv purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.

我认为NFKD(兼容性分解)后面是非ASCII字符的过滤可能会让你接近你想要的。显然,这是一个有损的过程;你永远无法恢复原始字符串中的所有信息,所以要小心。

I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.

/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
  char ch = decomposed.charAt(idx);
  if (ch < 128)
    buf.append(ch);
}
String filtered = buf.toString();

使用此处使用的过滤,您可能会渲染一些不可读的字符串。例如,一串中文字符将被完全过滤掉,因为它们都没有ASCII表示(这更像是iconv的 // IGNORE )。

With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE).

总的来说,构建自己的有效字符替换查找表,或至少组合可剥离安全的字符(重音符号和东西)会更安全。最佳解决方案取决于您希望处理的输入字符范围。

Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.

这篇关于在java中是否有一个与// TRANSLIT等效的iconv?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆