为SQL Server中的nvarchar列选择可以区分'ss'和'ß'的二进制排序规则 [英] Choosing a binary collation that can differentiate between 'ss' and 'ß' for nvarchar column in Sql Server

查看:74
本文介绍了为SQL Server中的nvarchar列选择可以区分'ss'和'ß'的二进制排序规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于SQL Server的默认 SQL_Latin1_General_CP1_CI_AS 排序规则无法区分 ss ß,我想将表中特定列的排序规则更改为 SQL_Latin1_General_CP437_BIN2 ,如在这里


但是,我不确定这是否是一个好习惯。另外,除了以下内容之外,我还不确定其他含义:



  • 更改排序顺序:因为我从不对数据进行排序在本专栏上,对我来说可能不是问题。但是,如果您认为相反,请告诉我。

  • 将不区分大小写更改为区分大小写:由于我的应用程序始终以小写形式提供文本,对于我来说,这种变化也不是问题。但是,如果您认为相反,请告诉我。


我对这种更改的其他主要含义(如果有)感到好奇。


此外,我还想知道以下哪种情况最适合这种情况:


SQL_Latin1_General_CP437_BIN


说明: Latin1-General,Unicode数据的二进制排序,代码页437上的SQL Server非Unicode数据的SQL Server排序顺序30 / p>



SQL_Latin1_General_CP437_BIN2


说明: Latin1-General, Unicode数据的二进制代码点比较排序,非Unicode数据的代码页437上的SQL Server排序顺序30




SQL_Latin1_General_CP850_BIN


说明: Latin1-常规,Unicode数据的二进制排序,代码页850上SQL Server的非Unicode数据的排序顺序40




SQL_La tin1_General_CP850_BIN2


说明: Latin1-General,Unicode代码的二进制代码点比较排序,非850代码页上的SQL Server排序顺序40 Unicode数据


如果您认为还有其他更适合这种情况的排序规则,请也提一下。


< hr />

2017年3月19日更新:
对于遇到此问题的任何人:



  • 必须检查@srutzky和@SqlZim的答案以及相关的参考资源。在这种情况下,您不想着急。

  • 由于更改排序规则并不适合胆怯的:P,因此保留表数据备份可能会派上用场。

  • 还检查列的依赖关系,例如索引和约束;您可能需要删除并创建它们,就像我的情况一样。


玩得开心:)

解决方案

关于整理的一些事情:


  1. SQL _ 从SQL Server 2000开始不赞成使用排序规则(是,2000)。如果可以避免使用它们,则应该这样做(但这并不意味着在没有紧迫需要的情况下就去改变一堆东西!)。



    问题 SQL _ 排序规则实际上仅与 VARCHAR (即非Unicode)数据相关,如 NVARCHAR (即Unicode)数据使用操作系统中的规则。但是,不幸的是, VARCHAR 数据的排序和比较规则使用简单的映射,并且不包括更复杂的语言规则。这就是为什么 ss ß在存储为 VARCHAR 使用相同的 SQL_Latin1_General_CP1_CI_AS 归类。当在单词的中间使用时,这些不推荐使用的归类也不能使破折号的权重降低。非 SQL _ 排序规则(即Windows排序规则)对 VARCHAR 使用相同的规则NVARCHAR ,因此 VARCHAR 的处理更加可靠,与 NVARCHAR 更加一致。


  2. 在SQL Server 2005中已弃用 _BIN 排序规则。如果可以避免使用它们,则应该(但



    _BIN 排序规则非常微妙,因为它仅影响排序。 _BIN _BIN2 归类之间的比较是相同的,因为它们是在字节级别进行比较的(因此没有语言规则) 。但是,由于SQL Server(和Windows / PC)是Little Endian,因此实体以反向字节顺序存储。当处理双字节字符时,这变得很明显,这是 NVARCHAR 数据是:UTF-16 Little Endian。这意味着Unicode代码点U + 1216在Big Endian系统上具有0x1216的十六进制/二进制表示,但在Little Endian系统上存储为0x1612。为了完整起见,以便使最后一点的重要性(希望)变得显而易见: _BIN 归类将逐字节比较(在第一个字符之后),因此请参见U +1216为0x16,然后为0x12,而 _BIN2 归类将逐个代码地比较代码,因此将U + 1216视为0x12,然后为0x16。


  3. 此特定列为 NVARCHAR (使用 VARCHAR SQL_Latin1_General_CP1_CI_AS 不等于 ss ß),等等仅此一列, SQL_Latin1_General_CP437_BIN2 SQL_Latin1_General_CP850_BIN2 之间就没有区别,因为Unicode是一个包含所有字符的字符


  4. 对于 VARCHAR 数据,由于它们是不同的代码页(< a href = https://msdn.microsoft.com/zh-cn/library/cc195060.aspx rel = noreferrer> 437 和 850 ),并且两者都与您现在正在使用的那个( CP1 ==代码页 1252 )。


  5. 虽然使用二进制排序规则通常会过分使用,但在这种情况下,由于只有一种语言环境/文化不将ß ss 等同:匈牙利语。使用匈牙利归类可能会有一些您不想要(或至少不会期望)的语言规则,因此二进制归类在这里似乎是更好的选择(只是您要问的4种都不是:-) 。请记住,通过使用二进制排序规则,不仅会放弃 all 语言规则,而且还会失去将同一字符的不同版本等同的功能,例如 A 拉丁文大写字母A U + 0041 )和 A 全角拉丁大写字母U + FF21 )。



    使用以下查询查看哪些归类是非二进制的并且不等于这些字符:

      DECLARE @SQL NVARCHAR(MAX)= N'DECLARE @Counter INT = 1;'; 

    选择@SQL + = REPLACE(N'
    IF(N''ß''COLLATE {Name}<> N''ss''COLLATE {Name})
    开始
    RAISERROR(N''%4d。{Name}'',10,1,@Counter)时为NOWAIT;
    SET @Counter + = 1;
    END;
    ',N'{Name}',col。[name])+ NCHAR(13)+ NCHAR(10)
    FROM sys.fn_helpcollat​​ions()col
    WHERE col。[name]不喜欢N'SQL [_]%'
    AND col。[name]不喜欢N'%[_] BIN%'
    OR BY BY col。[name]

    - PRINT @SQL;
    EXEC(@SQL);


所以:




  • 如果要使用二进制排序规则,请使用 Latin1_General_100_BIN2 之类的东西。

  • 不需要需要更改整个数据库及其所有表的排序规则。这是很多的工作,唯一的内置机制是无证的(即不受支持)。

  • 如果要更改数据库的默认排序规则,这会影响数据库范围的项目(例如表,列,索引,函数,存储过程等)的名称解析。含义:您将需要使100%涉及数据库的应用程序以及所有SQL退化

  • 如果大多数/所有使用此列的查询都需要ß ss 视为不同,然后继续更改该列以使用 Latin1_General_100_BIN2 。这可能需要删除以下相关对象,然后在 ALTER TABLE 之后重新创建:




    • 索引

    • 唯一约束

    • 外键约束



    提示:请确保检查列的当前NULL / NOT NULL设置,并在 ALTER TABLE ... ALTER COLUMN ... 语句,这样它就不会被更改。


  • 如果只有某些查询需要这种不同的行为,则可以用 COLLATE 子句,按条件(例如 WHERE选项卡。[ThisColumn]喜欢N'%ss%'COLLATE Latin1_General_100_BIN2 )。 COLLATE 关键字仅应在(运算符的)一侧使用,因为排序规则优先级会将其应用于另一侧。



有关使用字符串和归类的更多信息,请访问:归类信息


As the default SQL_Latin1_General_CP1_CI_AS collation of SQL server can't differentiate between ss and ß, I want to change the collation of a specific column in a table to SQL_Latin1_General_CP437_BIN2, as advised in here.

However, I am not sure whether this is generally a good practice or not. Also I am not sure about the implications other than the following:

  • Changing the sort order: As I am never sorting the data on this column, it might not be a problem for me. However, if you think otherwise, please let me know.
  • Changing case-insensitivity to case-sensitivity: As my application always provide text in lowercase, I think this change will also not be a problem for me. However, if you think otherwise, please let me know.

I am curious about the other major implications of this change, if any.

Additionally, I would also like to know which one of the following would be a most suited for this scenario:

SQL_Latin1_General_CP437_BIN

Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data


SQL_Latin1_General_CP437_BIN2

Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 30 on Code Page 437 for non-Unicode Data


SQL_Latin1_General_CP850_BIN

Description: Latin1-General, binary sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data


SQL_Latin1_General_CP850_BIN2

Description: Latin1-General, binary code point comparison sort for Unicode Data, SQL Server Sort Order 40 on Code Page 850 for non-Unicode Data

If you think that there are other collations better suited for this scenario, please mention those as well.


Update on 19.03.2017: To anyone coming to this question:

  • Must check both the answers from @srutzky and @SqlZim, as well as the related referred resources. You don't want to rush into things in this case.
  • As changing collation is not for faint hearted :P, keeping a backup of table data might come in handy.
  • Also check the dependencies on column, such as index and constraint; you may need to drop and create those, as it were in my case.

Have fun :)

解决方案

A few things about Collations:

  1. The SQL_ Collations were deprecated as of SQL Server 2000 (yes, 2000). If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).

    The issue with the SQL_ Collations is really only related to VARCHAR (i.e. non-Unicode) data as NVARCHAR (i.e. Unicode) data uses the rules from the OS. But the rules for sorting and comparison for VARCHAR data, unfortunately, use a simple mapping and do not include the more complex linguistic rules. This is why ss and ß do not equate when stored as VARCHAR using the same SQL_Latin1_General_CP1_CI_AS Collation. These deprecated Collations also are not able to give a lower weight to dashes when used in the middle of a word. The non-SQL_ Collations (i.e. Windows Collations) use the same rules for both VARCHAR and NVARCHAR so the VARCHAR handling is more robust, more consistent with NVARCHAR.

  2. The _BIN Collations were deprecated as of SQL Server 2005. If you can avoid using them, you should (but that doesn't mean go changing a bunch of things if there is no pressing need to!).

    The issue with the _BIN Collations is rather subtle as it only affects sorting. Comparisons are the same between _BIN and _BIN2 Collations due to them being compared at the byte level (hence no linguistic rules). BUT, due to SQL Server (and Windows / PCs) being Little Endian, entities are stored in reverse byte order. This becomes apparent when dealing with double-byte "characters", which is what NVARCHAR data is: UTF-16 Little Endian. This means that Unicode Code Point U+1216 has a hex/binary representation of 0x1216 on Big Endian systems, but is stored as 0x1612 on Little Endian systems. To come full circle so that the importance of this last point will (hopefully) become obvious: the _BIN Collations will compare byte by byte (after the first character) and hence see U+1216 as being 0x16 and then 0x12, while the _BIN2 Collations will compare code point by code point and hence see U+1216 as being 0x12 and then 0x16.

  3. This particular column is NVARCHAR (a VARCHAR column using SQL_Latin1_General_CP1_CI_AS would not equate ss and ß) and so for just this column alone, there is no difference between SQL_Latin1_General_CP437_BIN2 and SQL_Latin1_General_CP850_BIN2 due to Unicode being a single, all-inclusive character set.

  4. For VARCHAR data, there would be a difference since they are different code pages (437 and 850), and both of those are different than the one that you are using now (CP1 == code page 1252).

  5. While using a binary Collation is often overkill, in this case it might be necessary given that there is only one locale / culture that does not equate ß with ss: Hungarian. Using a Hungarian Collation might have some linguistic rules that you don't want (or at least wouldn't expect), so the binary Collation seems to be the better choice here (just not any of the 4 you are asking about :-). Just keep in mind that by using a binary Collation, not only are you giving up all linguistic rules, but you also lose the ability to equate different versions of the same character, such as A (Latin Capital Letter A U+0041) and (Fullwidth Latin Capital Letter A U+FF21).

    Use the following query to see what Collations are non-binary and do not equate these characters:

    DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1;';
    
    SELECT @SQL += REPLACE(N'
      IF(N''ß'' COLLATE {Name} <> N''ss'' COLLATE {Name})
      BEGIN
        RAISERROR(N''%4d.  {Name}'', 10, 1, @Counter) WITH NOWAIT;
        SET @Counter += 1;
      END;
    ', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
    FROM   sys.fn_helpcollations() col
    WHERE  col.[name] NOT LIKE N'SQL[_]%'
    AND    col.[name] NOT LIKE N'%[_]BIN%'
    ORDER BY col.[name]
    
    --PRINT @SQL;
    EXEC (@SQL);
    

So:

  • If you are going to use a binary Collation, use something like Latin1_General_100_BIN2.
  • You do not need to change the Collation of the entire DB and all of its tables. That is a lot of work, and the only "built-in" mechanism to do it is undocumented (i.e. unsupported).
  • If you were to change the Database's default Collation, that affects name resolution of Database-scoped items such as tables, columns, indexes, functions, stored procedures, etc. Meaning: you would need to regress 100% of the application that touches the database, as well as all SQL Server Agent jobs, etc. that touch this database.
  • If most / all of the queries that use this column need ß with ss to be seen as different, then go ahead and alter the column to use Latin1_General_100_BIN2. This will likely require dropping the following dependent objects and then recreating after the ALTER TABLE:

    • Indexes
    • Unique Constraints
    • Foreign Key Constraints

    HINT: Be sure to check the current NULL / NOT NULL setting of the column and specify that in the ALTER TABLE ... ALTER COLUMN ... statement so that it does not get changed.

  • If only some queries need this different behavior, then override just those comparison operations with the COLLATE clause, on a per-condition basis (e.g. WHERE tab.[ThisColumn] LIKE N'%ss%' COLLATE Latin1_General_100_BIN2). The COLLATE keyword should only be needed on one side (of the operator) as Collation Precedence will apply it to the other side.

For more info on working with strings and collations, please visit: Collations Info

这篇关于为SQL Server中的nvarchar列选择可以区分'ss'和'ß'的二进制排序规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆