NVarchar,Bin排序规则,Bin2排序规则有什么区别? [英] What is the difference between NVarchar, Bin collation, Bin2 collation?

查看:96
本文介绍了NVarchar,Bin排序规则,Bin2排序规则有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有3个选项均区分大小写和重音,并支持Unicode。
根据文档:

All 3 options are case and accent sensitive, and support Unicode. According to the documentation:


  1. NVarchar会根据相关语言或字母的词典对数据进行排序和比较(?)

  1. NVarchar sorts and compares data based on the "dictionaries for the associated language or alphabet" (?)

Bin根据位模式(?)

Bin sorts and compares data based on the "bit patterns" (?)

Bin2基于 Unicode数据的Unicode代码点(?)排序和比较数据

Bin2 sorts and compares data based on "Unicode code points for Unicode data" (?)

简单地说,我可以说Bin是NVarchar的改进,而Bin2是Bin的改进;除非我局限于向后兼容,否则始终建议使用Bin2或至少使用Bin才能获得更好的性能?

To make complex things simple, can I say that the Bin is an improvement of the NVarchar and the Bin2 is an improvement of the Bin; and unless I am restricted to backwards compatibility, it is always recommended to use Bin2 or at least Bin in order to enjoy a better performance?

======= ================================================== ===============
我将尝试再次解释自己。

========================================================================= I will try to explain my self again.

看看:

If Object_ID('words2','U') Is Not Null Drop Table words2;
Create  Table words2(word1 NVarchar(20),
                    word2 NVarchar(20) Collate Cyrillic_General_BIN,
                    word3 NVarchar(20) Collate Cyrillic_General_BIN2);

Insert
Into    words2
Values  (N'ھاوتایی',N'ھاوتایی',N'ھاوتایی'),
        (N'Συμμετρία',N'Συμμετρία',N'Συμμετρία'),
        (N'אבַּג',N'אבַּג',N'אבַּג'),
        (N'対称性',N'対称性',N'対称性');
Select * From words2;

所有3个选项都支持各种字母,无论排序规则是什么。

All 3 options support all kinds of alphabet, no matter what is the collation.

问题是-三种选择之间的实际区别是什么?假设我想用不同的字母存储私人名称,我可以使用哪个选项?我猜我将必须找到特定的名称(选择..从..哪里..),订单名称(选择..从..排序..)。

The question is- what is practical difference between the 3 options? Suppose I want to store private names in different alphabets, which option may I use? I guess I will have to find specific names (Select .. From.. Where..), order names (Select.. From.. Order By..).

推荐答案


所有3个选项均区分大小写和重音,并支持Unicode。

All 3 options are case and accent sensitive, and support Unicode.

NVARCHAR 是数据类型(例如 INT DATETIME 等。),而不是一种选择。它以UCS-2 / UTF-16(Little Endian)编码存储Unicode字符。对于U + 0000至U + FFFF(十进制值0-65535)范围,UCS-2和UTF-16是相同的代码点。 UTF-16处理U + 10000及以上的代码点(称为补充字符),所有这些代码点均定义为UCS-2范围内存在的代码点对(称为代理对)。由于两者之间的字节序列相同,因此唯一的区别在于数据的处理。意思是,内置函数在使用 not _SC 结尾的排序规则时不知道如何解释补充字符,但是它们确实可以正常工作使用以 _SC 结尾的归类时的完整UTF-16范围。 _SC 排序规则是在SQL Server 2012中添加的,但是您仍然可以在以前的版本中存储和检索补充字符;

NVARCHAR is a datatype (like INT, DATETIME, etc.) and not an option. It stores Unicode characters in the UCS-2 / UTF-16 (Little Endian) encoding. UCS-2 and UTF-16 are the identical code points for the U+0000 through U+FFFF (decimal values 0 - 65535) range. UTF-16 handles code points U+10000 and above (known as Supplementary Characters), all of which are defined as pairs of code points (known as Surrogate Pairs) that exist in the UCS-2 range. Since the byte sequences are identical between the two, the only difference is in the handling of the data. Meaning, built-in functions do not know how to interpret Supplementary Characters when using Collations that do not end in _SC, whereas they do work correctly for the full UTF-16 range when using Collations that do end in _SC. The _SC Collations were added in SQL Server 2012, but you can still store and retrieve Supplementary Characters in prior versions; it is only the built-in functions that do not behave as expected when operating on Supplementary Characters.

更直接:


  1. NVARCHAR 作为数据类型,本质上并不区分大小写或重音(或其他敏感度)。确切的行为取决于列的排序规则集,数据库的默认排序规则或 COLLATE 子句,具体取决于表达式的上下文。

  2. 虽然这是一个非常普遍的误解,但二进制排序规则既不区分大小写,也不区分变音。仅当简单地查看时,它们会显示。保持敏感状态意味着能够检测特定灵敏度的差异(大小写,重音,宽度,假名类型,并从SQL Server 2017开始:变化选择器),同时仍允许其他灵敏度和/或基础字节表示形式的差异。有关更多详细信息和示例,请参阅:不,二进制排序规则不是大小写敏感的。

  1. NVARCHAR, being a datatype, is not inherently case or accent (or any other sensitivity) sensitive or insensitive. The exact behavior depends on the collation set for the column, or the database's default collation, or the COLLATE clause, depending on the context of the expression.
  2. While it is an extremely common misconception, binary collations are neither case nor accent -sensitive. It only appears that they are when viewed simplistically. Being "sensitive" means being able to detect differences for a particular sensitivity (case, accent, width, Kana type, and starting in SQL Server 2017: variation selector) while still allowing for differences in other sensitivities and/or underlying byte representations. For more details and examples, please see: No, Binary Collations are not Case-Sensitive.

排序规则,实际上是关于字符如何在SQL中进行排序和比较的服务器还暗示了Locale / LCID(它确定了覆盖这些比较的默认处理的文化规则)和用于 VARCHAR 数据的代码页。

Collations, while literally being about how characters sort and compare to each other, in SQL Server also imply the Locale / LCID (which determines the cultural rules that override the default handling of those comparisons) and the Code Page used for VARCHAR data.

非二进制归类被认为是字典。排序/比较,因为它们考虑到归类指定的特定区域性规则(特别是相关的LCID)。另一方面,二进制排序规则不处理任何特定于区域性的规则,仅基于每个2字节序列的数值进行排序和比较。因此,二进制排序规则要快得多,因为它们不需要应用大量规则,但是它们也无法知道单个两字节代码点 u 与2个两个字节的序列不同,这两个序列是 u ,另外一个重音符号将在屏幕上呈现为与单个两个字节代码点,在使用非二进制排序规则时将比较为相等。

Non-binary collations are considered "dictionary" sorting / comparisons because they take into account the rules of the particular culture specified by the Collation (specifically the associated LCID). On the other hand, binary collations do not deal with any culture-specific rules and only sort and compare based on the numeric value of each 2-byte sequence. For this reason binary collations are much faster because they don't need to apply a large list of rules, but they also have no way to know that single two-byte Code Point that is a u with an accent is not the same as 2 two-byte sequences which are a u and a separate accent that will render on screen the same as the single two-byte code point, and will compare as being equal when using a non-binary collation.

_BIN 与<$ c之间的差$ c> _BIN2 排序准确性,不是性能。较旧的 _BIN 归类进行逐字节的简单排序和比较(在第一个字符之后,该字符被视为代码点,而不是两个字节,因此 排序正确),而较新的 _BIN2 排序规则(从SQL Server 2005开始)将每个单元代码进行比较。 (补充字符由两个代码单元组成,并且 _BIN2 归类将每个代码单元单独查看,而不是将它们的组合视为一个代码点)。这两种方法之间的排序顺序有所不同,主要是由于SQL Server是 Little Endian。在其中存储字节(对于单个实体:UTF-16代码单元, INT 值, BIGINT 值等)相反的顺序。因此,当使用 _BIN 归类时,代码点U + 0206实际上将在之后 U + 0402排序:

The difference between _BIN and _BIN2 is sorting accuracy, not performance. The older _BIN collations do a simplistic byte-by-byte sorting and comparison (after the first character, which is seen as a code point and not two bytes, thus it sorts correctly) whereas the newer _BIN2 collations (starting in SQL Server 2005) compare each Code "Unit" (Supplementary Characters are made up of two Code Units, and _BIN2 collations see each Code Unit individually instead of seeing the combination of them as a Code Point). There is a difference in sort order between these two approaches mainly due to SQL Server being "Little Endian" which stores bytes (for a single entity: UTF-16 code unit, INT value, BIGINT value, etc) in reverse order. Hence, code point U+0206 will actually sort after U+0402 when using a _BIN collation:

SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM  (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN;
/*
ID    Thing    ThingBytes
2     aЂ       0x61000204
1     aȆ       0x61000602  <-- U+0206, stored as 0x06 then 0x02, should sort first
*/

SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM  (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN2;
/*
ID    Thing    ThingBytes
1     aȆ       0x61000602
2     aЂ       0x61000204
*/

有关此区别的更多详细信息和示例,请参阅:各种二进制排序规则(文化,版本以及BIN与BIN2之间的差异)

For more details and examples of this distinction, please see: Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2).

此外,对于Unicode / NVARCHAR 数据,所有二进制排序规则都以完全相同的方式进行排序和比较。代码点是数值,比较时没有语言/文化差异。因此,具有多于一个的全局二进制的唯一目的就是为了实现。排序规则是仍然需要指定用于 VARCHAR 数据的代码页。

Also, all binary collations sort and compare in exactly the same manner when it comes to Unicode / NVARCHAR data. Code Points are numerical values and there are no linguistic / cultural variations to consider when comparing them. Hence the only purpose in having more than a single, global "BINARY" Collation is the need to still specify the Code Page to use for VARCHAR data.


假设我想要用不同的字母存储私人名称,我可以使用哪个选项?

Suppose I want to store private names in different alphabets, which option may I use?

如果您使用的是 VARCHAR 字段,然后特定于排序规则(无论是二进制还是非二进制)将确定哪些字符可用,因为这是8位扩展ASCII,通常具有256个不同字符的范围(除非使用双字节字符集,在哪种情况下,它可以处理更多内容,但大多数情况下仍然是单一文化/字母)。如果使用 NVARCHAR 存储数据,因为它是Unicode,所以它具有单个字符集,其中包括来自所有语言的所有字符,以及许多其他内容。

If you were using VARCHAR fields, then the Collation specific (regardless of binary or non-binary) would determine which characters are available since that is 8-bit Extended ASCII which typically has a range of 256 different characters (unless using a Double-Byte Character Set, in which case it can handle many more, but those are still mostly of a single culture / alphabet). If using NVARCHAR to store the data, since that is Unicode it has a single character set comprised of all characters from all languages, plus lots of other stuff.

因此,选择 NVARCHAR 可以解决以下问题:可以保留来自各种语言的名称的正确字符。但是,您仍然需要选择特定的文化词典规则,以便以每种特定文化期望的方式进行排序。这是一个问题,因为无法动态设置排序规则。因此,选择最常用的一种。二进制排序规则在这里无济于事,实际上会与您尝试执行的操作背道而驰。但是,当您需要区分在其他情况下等价的字符时,它们非常方便,例如在这种情况下: SQL Server过滤CJK标点符号字符(这里是SO)。

So choosing NVARCHAR takes care of the problem of being able to hold the proper characters of names coming from various languages. HOWEVER, you still need to pick a particular cultures dictionary rules in order to sort in a manner that each particular culture expects. This is a problem because Collations cannot be set dynamically. So pick the one that is used the most. Binary collations will not help you here, and in fact would go against what you are trying to do. They are, however, quite handy when you need to distinguish between characters that would otherwise equate, such as in this case: SQL server filtering CJK punctuation characters (here on S.O.).

另一个与我使用 _BIN2 排序规则相关的场景是检测大小写更改在网址中。 URL的某些部分不区分大小写,例如主机名/域名。但是,在QueryString中,传入的值可能是敏感的。如果在不区分大小写的操作中比较URL值,则 http://domain.tld/page.ext?var1 = val 将等于 http://domain.tld/page.ext?var1 = VAL ,并且这些值不应假定为相同。使用区分大小写的排序规则通常也可以正常工作,但是我使用 Latin1_General_100_BIN2 ,因为它速度更快(没有语言规则),并且不会忽略更改ü u +组合音调(呈现为ü )。

Another related scenario in which I have used a _BIN2 collation was detecting case changes in URLs. Some parts of a URL are case-insensitive, such as the hostname / domain name. But, in the QueryString, the values being passed in are potentially sensitive. If you compare URL values in a case-insensitive operation, then http://domain.tld/page.ext?var1=val would equate to http://domain.tld/page.ext?var1=VAL, and those values should not be assumed to be the same. Using a case-sensitive Collation would also typically work, but I use Latin1_General_100_BIN2 because it's faster (no linguistic rules) and would not ignore a change of ü to u + combining diaeresis (which renders as ).

我对以下规则的排序规则有更多的解释(因此,此处不再重复,因为其中大多数包含几个示例):

I have more explanations of Collations spread across the following answers (so won't duplicate here as most of them contain several examples):

  • UCS-2 and SQL Server
  • SQL Server default character encoding
  • What is the point of COLLATIONS for nvarchar (Unicode) columns?
  • Unicode to Non-unicode conversion
  • NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server

这些在DBA.StackExchange上:

And these are on DBA.StackExchange:

  • How To Strip Hebrew Accent Marks
  • Latin1_General_BIN performance impact when changing the database default collation
  • Storing Japanese characters in a table

有关使用归类,编码,Unicode等的更多信息,请访问:归类信息

For more info on working with Collations, Encodings, Unicode, etc, please visit: Collations Info

这篇关于NVarchar,Bin排序规则,Bin2排序规则有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆