将符号、重音字母转换为英文字母 [英] Converting Symbols, Accent Letters to English Alphabet

查看:49
本文介绍了将符号、重音字母转换为英文字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是,如您所知,中有数千个字符Unicode 图表,我想将所有相似的字符转换为英文字母表中的字母.

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.

例如这里有一些转换:

ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...

然后我看到字母A/a有20多个版本.我不知道如何分类.它们看起来就像大海捞针.

and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.

unicode 字符的完整列表位于 http://www.ssec.wisc.edu/~tomw/java/unicode.htmlhttp://unicode.org/charts/charindex.html .只需尝试向下滚动并查看字母的变化.

The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.

我如何用 Java 转换所有这些?请帮帮我:(

How can I convert all these with Java? Please help me :(

推荐答案

如何从 .NET 中的字符串中删除变音符号(重音)?

这个方法在java中运行良好(纯粹是为了去除变音符号,也就是重音符号).

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

它基本上将所有重音字符转换为它们的 deAccented 对应字符,然后是它们的组合变音符号.现在您可以使用正则表达式去除变音符号.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

这篇关于将符号、重音字母转换为英文字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆