如何比较“相似"的 Unicode 字符? [英] How to compare Unicode characters that "look alike"?

查看:29
本文介绍了如何比较“相似"的 Unicode 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个令人惊讶的问题.

我在我的应用程序中加载了一个文本文件,并且我有一些逻辑来比较具有 µ 的值.

我意识到即使文本相同,比较值也是假的.

 Console.WriteLine("μ".Equals("µ"));//返回假Console.WriteLine("µ".Equals("µ"));//返回真

在后面的行中,字符 µ 被复制粘贴.

然而,这些可能不是唯一的字符.

在 C# 中有什么方法可以比较看起来相同但实际上不同的字符吗?

解决方案

在很多情况下,您可以将两个 Unicode 字符标准化 为某种标准化形式,并且它们应该能够匹配.当然,你需要使用哪种归一化形式取决于字符本身;仅仅因为它们长相并不一定意味着它们代表相同的角色.您还需要考虑它是否适合您的用例 - 请参阅 Jukka K. Korpela 的评论.

对于这种特殊情况,如果您参考托尼的回答,您会看到 U+00B5 说:

<块引用>

分解<compat>希腊小写字母 MU (U+03BC)

这意味着原始比较中的第二个字符 U+00B5 可以分解为第一个字符 U+03BC.

因此,您将使用完全兼容性分解,使用规范化形式 KC 或 KD 来规范化字符.这是我编写的一个简单示例,用于演示:

使用系统;使用 System.Text;课程计划{静态无效主(字符串 [] args){字符第一 = 'μ';字符秒 = 'µ';//技术上你只需要归一化 U+00B5 就可以得到 U+03BC,但是//如果您不确定哪个字符是哪个,您可以安全地规范化两者字符串 firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);字符串 secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);Console.WriteLine(first.Equals(second));//错误的Console.WriteLine(firstNormalized.Equals(secondNormalized));//真的}}

有关 Unicode 规范化和不同规范化形式的详细信息,请参阅 System.Text.NormalizationFormUnicode 规范.>

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if the texts are same the compare value is false.

 Console.WriteLine("μ".Equals("µ")); // returns false
 Console.WriteLine("µ".Equals("µ")); // return true

In later line the character µ is copy pasted.

However, these might not be the only characters that are like this.

Is there any way in C# to compare the characters which look the same but are actually different?

解决方案

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        char first = 'μ';
        char second = 'µ';

        // Technically you only need to normalize U+00B5 to obtain U+03BC, but
        // if you're unsure which character is which, you can safely normalize both
        string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
        string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);

        Console.WriteLine(first.Equals(second));                     // False
        Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
    }
}

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

这篇关于如何比较“相似"的 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆