如何比较“相似"的 Unicode 字符? [英] How to compare Unicode characters that "look alike"?
问题描述
我遇到了一个令人惊讶的问题.
我在我的应用程序中加载了一个文本文件,并且我有一些逻辑来比较具有 µ 的值.
我意识到即使文本相同,比较值也是假的.
Console.WriteLine("μ".Equals("µ"));//返回假Console.WriteLine("µ".Equals("µ"));//返回真
在后面的行中,字符 µ 被复制粘贴.
然而,这些可能不是唯一的字符.
在 C# 中有什么方法可以比较看起来相同但实际上不同的字符吗?
在很多情况下,您可以将两个 Unicode 字符标准化 为某种标准化形式,并且它们应该能够匹配.当然,你需要使用哪种归一化形式取决于字符本身;仅仅因为它们长相并不一定意味着它们代表相同的角色.您还需要考虑它是否适合您的用例 - 请参阅 Jukka K. Korpela 的评论.
对于这种特殊情况,如果您参考托尼的回答,您会看到 U+00B5 说:
<块引用>分解<compat>希腊小写字母 MU (U+03BC)
这意味着原始比较中的第二个字符 U+00B5 可以分解为第一个字符 U+03BC.
因此,您将使用完全兼容性分解,使用规范化形式 KC 或 KD 来规范化字符.这是我编写的一个简单示例,用于演示:
使用系统;使用 System.Text;课程计划{静态无效主(字符串 [] args){字符第一 = 'μ';字符秒 = 'µ';//技术上你只需要归一化 U+00B5 就可以得到 U+03BC,但是//如果您不确定哪个字符是哪个,您可以安全地规范化两者字符串 firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);字符串 secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);Console.WriteLine(first.Equals(second));//错误的Console.WriteLine(firstNormalized.Equals(secondNormalized));//真的}}
有关 Unicode 规范化和不同规范化形式的详细信息,请参阅 System.Text.NormalizationForm
和 Unicode 规范.>
I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm
and the Unicode spec.
这篇关于如何比较“相似"的 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!