从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) [英] Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

查看：32 发布时间：2021/11/25 13:21:11 java unicode diacritics transliteration

本文介绍了从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究一种可以在带有变音符号的字符之间进行映射的算法(波浪号、circumflex, 插入符号、变音符号、caron) 及其简单"特征.

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

例如:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

等

我想用 Java 来做这件事，虽然我怀疑它应该是 Unicode-y 并且应该可以在任何语言中轻松实现.

I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

目的:允许轻松搜索带有变音符号的单词.例如，如果我有一个网球运动员数据库，并且输入了 Björn_Borg，我也会保留 Bjorn_Borg，这样我就可以在有人输入 Bjorn 而不是 Björn 时找到它.

Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

推荐答案

我最近用 Java 做了这个:

I have done this recently in Java:

public static final Pattern DIACRITICS_AND_FRIENDS
    = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

这将按照您的指定进行:

This will do as you specified:

stripDiacritics("Björn")  = Bjorn

但它会失败，例如 Białystok，因为 ł 字符不是变音符号.

but it will fail on for example Białystok, because the ł character is not diacritic.

如果你想要一个完整的字符串简化器，你将需要第二轮清理，对于一些不是变音符号的特殊字符.是这张地图，我已经包含了出现在我们客户名称中的最常见的特殊字符.这不是一个完整的列表，但它会让你知道如何扩展它.immutableMap 只是来自 google-collections 的一个简单类.

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

public class StringSimplifier {
    public static final char DEFAULT_REPLACE_CHAR = '-';
    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()

        //Remove crap strings with no sematics
        .put(".", "")
        .put(""", "")
        .put("'", "")

        //Keep relevant characters as seperation
        .put(" ", DEFAULT_REPLACE)
        .put("]", DEFAULT_REPLACE)
        .put("[", DEFAULT_REPLACE)
        .put(")", DEFAULT_REPLACE)
        .put("(", DEFAULT_REPLACE)
        .put("=", DEFAULT_REPLACE)
        .put("!", DEFAULT_REPLACE)
        .put("/", DEFAULT_REPLACE)
        .put("\", DEFAULT_REPLACE)
        .put("&", DEFAULT_REPLACE)
        .put(",", DEFAULT_REPLACE)
        .put("?", DEFAULT_REPLACE)
        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?
        .put("|", DEFAULT_REPLACE)
        .put("<", DEFAULT_REPLACE)
        .put(">", DEFAULT_REPLACE)
        .put(";", DEFAULT_REPLACE)
        .put(":", DEFAULT_REPLACE)
        .put("_", DEFAULT_REPLACE)
        .put("#", DEFAULT_REPLACE)
        .put("~", DEFAULT_REPLACE)
        .put("+", DEFAULT_REPLACE)
        .put("*", DEFAULT_REPLACE)

        //Replace non-diacritics as their equivalent characters
        .put("u0141", "l") // BiaLystock
        .put("u0142", "l") // Bialystock
        .put("ß", "ss")
        .put("æ", "ae")
        .put("ø", "o")
        .put("©", "c")
        .put("u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90
        .put("u00F0", "d")
        .put("u0110", "d")
        .put("u0111", "d")
        .put("u0189", "d")
        .put("u0256", "d")
        .put("u00DE", "th") // thorn Þ
        .put("u00FE", "th") // thorn þ
        .build();


    public static String simplifiedString(String orig) {
        String str = orig;
        if (str == null) {
            return null;
        }
        str = stripDiacritics(str);
        str = stripNonDiacritics(str);
        if (str.length() == 0) {
            // Ugly special case to work around non-existing empty strings
            // in Oracle. Store original crapstring as simplified.
            // It would return an empty string if Oracle could store it.
            return orig;
        }
        return str.toLowerCase();
    }

    private static String stripNonDiacritics(String orig) {
        StringBuilder ret = new StringBuilder
        String lastchar = null;
        for (int i = 0; i < orig.length(); i++) {
            String source = orig.substring(i, i + 1);
            String replace = NONDIACRITICS.get(source);
            String toReplace = replace == null ? String.valueOf(source) : replace;
            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {
                toReplace = "";
            } else {
                lastchar = toReplace;
            }
            ret.append(toReplace);
        }
        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {
            ret.deleteCharAt(ret.length() - 1);
        }
        return ret.toString();
    }

    /*
    Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm
    InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc..
        IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm
        IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm
     */
    public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}]+");


    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}

这篇关于从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) [英] Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) [英] Remove diacritical marks (ń ǹ ň &#241; ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

从 Unicode 字符中删除变音符号 (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) [英] Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

登录关闭