什么是算作下UTF8统一code归类相同的字符的字符?什么VB.net功能可用于合并它们? [英] What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

查看:269
本文介绍了什么是算作下UTF8统一code归类相同的字符的字符?什么VB.net功能可用于合并它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

还有什么vb.net函数,将所有这些不同的字符映射到他们最标准的形式。

Also what's the vb.net function that will map all those different characters into their most standard form.

例如,TOLOWER将映射A和同一个人物吧?

For example, tolower would map A and a to the same character right?

我需要相同的功能为这些字符

I need the same function for these characters

德国

SS ===小号 Ü===ü Χιοσ==Χίος

ß === s Ü === u Χιοσ == Χίος

否则,有时我插入Χιοσ而后者当我插入ΧίοςMySQL的投诉,该ID已经存在。

Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.

所以,我想创建一个映射所有这些奇怪的字符到一个更稳定的一个唯一的ID。

So I want to create a unique ID that maps all those strange characters into a more stable one.

推荐答案

有关的东西的编码方面,看的 String.Normalize 。还要注意它的过载,指定一个特定的正常的表单要字符串转换,但是默认的正常形态(C)会工作得很好了近大家谁愿意的映射所有这些不同的字符为最标准的形式的。

For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".

然而,事情变得更加复杂,一旦你进入数据库和处理归类。

However, things get more complicated once you move into the database and deal with collations.

统一code正常化不会永远改变字符大小写。它仅涵盖里的人物基本上都是相等的情况下 - 看起来是一样的 1 ,意思是一样的。例如,

Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,

 Χιοσ != Χίος,

这两个西格玛字符被认为非等价和重音丝毫( \ u1F30 )相当于两个字符,平原丝毫的序列( \ u03B9 )和重音( \ u0313 )。

The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).

您真正的问题似乎是,你正在使用的Uni code字符串作为主键,这是不是最流行的数据库设计实践。这样的主键占用了比需要更多的空间,也必然随时间而改变(即使该应用程序的初始版本不打算支持)。哦,我忘了自己来归类的敏感性。相反,识别记录尤尼code字符串,有数据库模式的生成无意义连续整数为你插入的记录,和降级的统一code字符串的记录仅仅是属性。这样,他们可以相同或不同,都可以。

Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.

有可能仍是有用的存储用于搜索和更安全的后续处理的目的之前正常化它们;但在特殊情况下不区分大小写的整理的您使用将不再限制你的任何方式。

It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.

1 几乎万一兼容性正常化相同的,而不是规范的标准化。

1Almost the same in case of compatibility normalization as opposed to canonical normalization.

这篇关于什么是算作下UTF8统一code归类相同的字符的字符?什么VB.net功能可用于合并它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆