vb.net中是否有一个函数可以告诉我们在UTF8 Unicode排序规则下2个字符串是否等效? [英] Is there a function in vb.net that will tell us whether 2 string is equivalent under UTF8 unicode collation?

查看:123
本文介绍了vb.net中是否有一个函数可以告诉我们在UTF8 Unicode排序规则下2个字符串是否等效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题类似于如何在PHP字符串比较中模拟MySQL utf8_general_ci排序规则,但是我想要vb.net而不是PhP的功能.

This question is similar to How to emulate MySQLs utf8_general_ci collation in PHP string comparisons but I want the function for vb.net rather than PhP.

最近,我做了很多据称唯一的密钥.

Recently I make a lot of supposedly unique key.

在UTF8 Unicode排序规则下,某些键是等效的.

Some of the keys are equivalent under UTF8 unicode collation.

例如,查看以下两个键:

For example, look at these 2 key:

byers-street-bistro__38.15_-79.07 byers-street-bistro __38.15_-79.07

byers-street-bistro__38.15_-79.07 byers-street-bistro‎__38.15_-79.07

如果我将其粘贴到首页上,然后看一下您将看到的源代码

If I paste that into front page, and look at the source code you'll see

byers-street-bistro__38.15_-79.07

byers-street-bistro__38.15_-79.07

byers-street-bistro __38.15_-79.07

byers-street-bistro‎__38.15_-79.07

注意:在堆栈溢出中,它们看起来仍然不同.

Note: In stack overflow they still look different.

我知道这不一样.我想即使在堆栈交换中也不会显示.假设我有100万个这样的记录,并且我想测试MySQL UTF8归类是否将2个字符串声明为相同.我想在上传之前知道这一点.我该怎么做.

I know it's not the same. I guess even in stack exchange it doesn't show. Say I have 1 million such records and I want to test whether 2 string WILL be declared the same by MySQL UTF8 collation. I want to know that before uploading. How do I do that.

因此vb.net认为这些是不同的密钥.当我们创建mysql查询并将其上传到数据库时,数据库抱怨它是相同的键.只有一个人抱怨,并且将冻结100万个数据库的上传.

So vb.net think that those are different keys. When we created mysql query and upload that to database, the database complain it's the same key. Just one complain and the upload of 1 million databases will be stuck.

我们什至不知道地狱是什么?反正我们在哪里可以找到呢?

We don't even know what the hell is ‎? Where can we look that up anyway?

无论如何,我想要一个函数,当给定两个字符串时,该函数将告诉我它们是否将被视为相同.

Anyway, I want a function that when given 2 strings will tell me whether they will count as the same or not.

如果可能的话,我们需要一个将字符串转换为最标准"形式的函数.

If possible we want a function that convert strings into their most "standard" form.

例如,似乎没有任何编码,该函数将重新识别所有那些没有的字符并将其消除.

For example, ‎ seems to encode nothing and the function would recoqnize all those nothing character and eliminate that.

有这样的东西吗?

到目前为止,这是我要做的.我需要更全面的内容.

So far this is what I do. I need something more comprehensive.

    Private Function StraightenQuotesReplacement() As Generic.Dictionary(Of String, String)
    Static replacement As Generic.Dictionary(Of String, String)
    If replacement Is Nothing Then
        replacement = New Generic.Dictionary(Of String, String)
        replacement.Add(ChrW(&H201C), """")
        replacement.Add(ChrW(&H201D), """")
        replacement.Add(ChrW(&H2018), "'")
        replacement.Add(ChrW(&H2019), "'")
    End If
    Return replacement
End Function

<Extension()>
Public Function straightenQuotes(ByVal somestring As String) As String
    For Each key In StraightenQuotesReplacement.Keys
        somestring = somestring.Replace(key, StraightenQuotesReplacement.Item(key))
    Next
    Return somestring
End Function

<Extension()>
Public Function germanCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ä", "ae")
    t = t.Replace("ö", "oe")
    t = t.Replace("ü", "ue")
    t = t.Replace("Ä", "Ae")
    t = t.Replace("Ö", "Oe")
    t = t.Replace("Ü", "Ue")
    t = t.Replace("ß", "ss")
    Return t
End Function
<Extension()>
Public Function japaneseCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ヶ", "ケ")
    Return t
End Function

<Extension()>
Public Function greekCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ς", "σ")
    t = t.Replace("ι", "ί")

    Return t
End Function
<Extension()>
Public Function franceCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("œ", "oe")
    Return t
End Function

<Extension()>
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

<Extension()>
Public Function badcharacters(ByVal s As String) As String
    Dim t = s
    t = t.Replace(ChrW(8206), "")
    Return t
End Function

<Extension()>
Public Function sanitizeUTF8_Unicode(ByVal str As String) As String
    Return str.ToLower.removeDoubleSpaces.SpacetoDash.EncodeUrlLimited.straightenQuotes.RemoveDiacritics.greekCharacter.germanCharacter
End Function

推荐答案

可能对看起来相似的字符使用不同的unicode代码点,例如连字符减号(-U + 002D),破折号(-U + 2013)和破折号(-U + 2014)是三个看起来相似的不同字符:---

Probably using different unicode code points for characters that look similar, e.g. hyphen-minus (- U+002D), en-dash (– U+2013), and em-dash (— U+2014) are three different characters that all look similar: - – —

使用AscW()函数检查每个字符.

Use the AscW() function to check each character.

如以下注释中所述,请使用System.Text.NormalizationForm命名空间来确定哪些Unicode代码点被视为等效字符.

As discussed in the comments below, use the System.Text.NormalizationForm namespace to determine which Unicode code points are considered to be equivalent characters.

这篇关于vb.net中是否有一个函数可以告诉我们在UTF8 Unicode排序规则下2个字符串是否等效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆