UCS-2和SQL Server [英] UCS-2 and SQL Server

查看:81
本文介绍了UCS-2和SQL Server的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索选项用于存储SQL Server数据库中的数据大多数都是英语,但有时不是,可能会很大,我倾向于将大多数字符串数据存储为UTF-8编码.

While researching options for storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

但是,Microsoft之所以选择UCS-2,是因为我不完全了解,这导致我对此有所怀疑. SQL Server 2012的文档确实显示了如何创建

However, Microsoft chose UCS-2 for reasons that I don't fully understand which is causing me to second-guess that leaning. The documentation for SQL Server 2012 does show how to create a UTF-8 UDT, but the decision for UCS-2 presumably pervades SQL Server.

维基百科(有趣的是,UCS-2已过时而支持UTF-16 )指出,UTF-8是可变宽度的字符集,能够对任何Unicode数据点进行编码,并且它是provides the de facto standard encoding for interchange of Unicode text.因此,感觉任何Unicode字符都可以用UTF-8表示,并且由于大多数文本都是英语,因此表示的紧凑程度几乎是UCS-2的两倍(我知道磁盘是便宜的",但磁盘缓存却不是). "t,并且内存与我正在处理的数据大小相比没有.当工作集大于可用的RAM时,许多操作都会成倍下降.

Wikipedia (which interestingly notes that UCS-2 is obsolete in favor of UTF-16) notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point and that it provides the de facto standard encoding for interchange of Unicode text. So, it feels like any Unicode character can be represented in UTF-8, and since most text will be English, the representation will be nearly twice as compact as with UCS-2 (I know disk is "cheap", but disk cache isn't, and memory isn't in comparison to the data sizes I'm dealing with. Many operations degrade exponentially when the working set is larger than available RAM).

畅游UCS-2流会遇到什么问题?

What problems might I encounter by swimming up the UCS-2 stream?

推荐答案

在可能非常大的SQL Server数据库中存储大多数是英语但有时不是的数据,我倾向于将大多数字符串数据存储为UTF-8编码.

storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

与其他一些RDBMS允许选择编码的RDBMS不同,SQL Server以UTF-16(Little Endian)存储 only Unicode数据,以8位编码(扩展的ASCII)存储非Unicode数据. ,DBCS或EBCDIC),无论该字段的归类所隐含的任何代码页.

Unlike some other RDBMS's that allow for choosing an encoding, SQL Server stores Unicode data only in UTF-16 (Little Endian), and non-Unicode data in an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC) for whatever Code Page is implied by the Collation of the field.

Microsoft之所以选择UCS-2,是因为我不完全理解

Microsoft chose UCS-2 for reasons that I don't fully understand

考虑到UTF-16是在1996年中期引入并在2000年完全指定的,他们选择 UCS-2的决定就足够了.很多其他系统也使用(或使用过)它(请参阅: https://en.wikipedia.org/wiki/UTF-16#用法).他们决定继续使用 的决定可能会更受质疑,尽管这可能是由于Windows和.NET是UTF-16.字节的物理布局在UCS-2和UTF-16之间是相同的,因此从UCS-2升级系统以支持UTF-16应该纯粹是功能正常的,不需要更改任何现有数据.

Their decision to choose UCS-2 makes sense enough given that UTF-16 was introduced in mid-1996 and fully specified in 2000. A lot of other systems use (or used) it as well (please see: https://en.wikipedia.org/wiki/UTF-16#Usage). Their decision to continue with it might be more questionable, though it is probably due to Windows and .NET being UTF-16. The physical layout of the bytes is the same between UCS-2 and UTF-16, so upgrading systems from UCS-2 to support UTF-16 should be purely functional with no need to alter any existing data.

SQL Server 2012的文档确实显示了如何创建UTF-8 UDT,

The documentation for SQL Server 2012 does show how to create a UTF-8 UDT,

嗯,不.无论如何,通过SQLCLR创建自定义用户定义类型都是 not ,它将为您带来任何本机类型的替代.创建某些东西来处理专用数据非常方便.但是,即使是不同编码的字符串也远非专业化.采用这种方式处理字符串数据将破坏系统的任何可用性,更不用说性能,因为您将无法使用 any 内置字符串函数.如果您能够在磁盘空间上节省任何东西,那么这些收益将被整体性能上的损失所抵消.通过将UDT序列化为VARBINARY来存储它.因此,为了进行 any 字符串比较或排序,在二进制"/常规"比较之外,您必须将所有其他值一个接一个地转换为UTF-8,然后进行字符串比较以说明语言差异.

Um, no. Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions. If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a VARBINARY. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences.

此外,文档"实际上只是示例代码/概念证明之类的东西.该代码写于2003年( http ://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs ),用于SQL Server2005.我看到了用于测试功能的脚本,但没有涉及性能的脚本

Also, that "documentation" is really just sample code / proof of concept stuff. The code was written in 2003 ( http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs ) for SQL Server 2005. I saw a script to test functionality, but nothing involving performance.

但是,有关UCS-2的决定大概遍及了SQL Server.

but the decision for UCS-2 presumably pervades SQL Server.

是的,非常如此.默认情况下,内置函数的处理仅适用于UCS-2.但是从SQL Server 2012开始,您可以使用下列排序规则之一来让它们处理完整的UTF-16字符集(以及Unicode版本5或6,具体取决于您的操作系统和.NET Framework版本).名称以_SC结尾(即补充字符).

Yes, very much so. By default, the handling of the built-in functions is only for UCS-2. But starting in SQL Server 2012, you can get them to handle the full UTF-16 character set (well, as of Unicode Version 5 or 6, depending on your OS and version of the .NET Framework) by using one of the collations that has a name ending in _SC (i.e. Supplementary Characters).

Wikipedia ...注意到UCS-2已过时而支持UTF-16

Wikipedia ... notes that UCS-2 is obsolete in favor of UTF-16

正确. UTF-16和UCS-2都使用2字节代码点.但是UTF-16会成对使用(即代理对)来映射其他字符.用于这些对的代码点为此在UCS-2中保留,因此不用于映射到任何可用的符号.这就是为什么您可以在SQL Server中存储任何Unicode字符并将其正确存储和检索的原因.

Correct. UTF-16 and UCS-2 both use 2-byte code points. But UTF-16 uses some of them in pairs (i.e. Surrogate Pairs) to map additional characters. The code points used for these pairs are reserved for this purpose in UCS-2 and hence are not used to map to any usable symbols. This is why you can store any Unicode character in SQL Server and it will be stored and retrieved correctly.

Wikipedia ...注意到UTF-8是一个可变宽度的字符集,能够对任何Unicode数据点进行编码

Wikipedia ... notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point

正确,尽管会引起误解.是的,UTF-8是可变宽度的,但是UTF-16也是次要可变的,因为所有的补充字符都由两个双字节代码点组成.因此,尽管UCS-2始终为2个字节,但UTF-16每个符号使用2个或4个字节.但这不是误导部分.令人误解的是,任何其他Unicode编码都无法对所有其他代码点进行编码.尽管UCS-2可以保留它们但不能解释它们,但是UTF-16和UTF-32都可以映射所有Unicode代码点,就像UTF-8一样.

Correct, though misleading. Yes, UTF-8 is variable-width, but UTF-16 is also minorly variable since all of the Supplementary Characters are composed of two double-byte code points. Hence UTF-16 uses either 2 or 4 bytes per symbol, though UCS-2 is always 2 bytes. But that is not the misleading part. What is misleading is the implication that any other Unicode encoding isn't capable of encoding all other code points. While UCS-2 can hold them but not interpret them, both UTF-16 and UTF-32 can both map all Unicode code points, just like UTF-8.

及其[ed:UTF-8]为交换Unicode文本提供了事实上的标准编码.

and that it [ed: UTF-8] provides the de facto standard encoding for interchange of Unicode text.

这可能是正确的,但从操作角度来看完全不相关.

This may be true, but it is entirely irrelevant from an operational perspective.

感觉任何Unicode字符都可以用UTF-8表示

it feels like any Unicode character can be represented in UTF-8

同样,是正确的,但是完全无关紧要,因为UTF-16和UTF-32也会映射所有Unicode代码点.

Again, true, but entirely irrelevant since UTF-16 and UTF-32 also map all Unicode code points.

由于大多数文本都是英语,因此表示形式几乎是UCS-2的两倍

since most text will be English, the representation will be nearly twice as compact as with UCS-2

视情况而定,这很可能是正确的,并且您担心这种浪费的使用是正确的.但是,正如我在导致这一问题的问题中提到的( NVARCHAR(MAX)字段不能使用这种花式压缩,但是它们的IN ROW数据可以从常规ROW和/或PAGE压缩中受益.请参见以下内容,以了解对此压缩的说明以及比较以下数据大小的图表:启用了数据压缩的原始UCS-2/UTF-16,UTF-8和UCS-2/UTF-16.

Depending on circumstances this could very well be true, and you are correct to be concerned about such wasteful usage. However, as I mentioned in the question that lead to this one ( UTF-8 Support, SQL Server 2012 and the UTF8String UDT ), you have a few options to mitigate the amount of space wasted if most rows can fit into VARCHAR yet some need to be NVARCHAR. The best option is to enable ROW COMPRESSION or PAGE COMPRESSION (Enterprise Editon only!). Starting in SQL Server 2008 R2, they allow non-MAX NVARCHAR fields to use the "Standard Compression Scheme for Unicode" which is at least as good as UTF-8, and in some cases it is even better than UTF-8. NVARCHAR(MAX) fields cannot use this fancy compression, but their IN ROW data can benefit from regular ROW and/or PAGE Compression. Please see the following for a description of this compression and a chart comparing data sizes for: raw UCS-2 / UTF-16, UTF-8, and UCS-2 / UTF-16 with data compression enabled.

还请参见MSDN页面,以获取数据压缩以了解更多信息详细信息,因为存在一些限制(除了仅在企业版中可用-BUT允许从SQL Server 2016 SP1开始的所有所有版本使用),并且在某些情况下压缩可能会使情况变得更糟.

Please also see the MSDN page for Data Compression for more details as there are some restrictions (beyond it being available only in Enterprise Edition -- BUT made available to all editions starting with SQL Server 2016, SP1 !!) and some circumstances when compression might make things worse.

我知道磁盘是便宜的"

I know disk is "cheap"

该语句的准确性取决于如何定义磁盘".如果您说的是商品零件,可以在商店购买以在台式机/笔记本电脑中使用.但是,如果说到将用于生产系统的企业级存储,那么请向控制预算的人解释一下,他们不应拒绝您想要的百万美元以上的SAN,因为这很便宜",这很有趣. ;-).

The veracity of that statement depends on how one defines "disk". If you are speaking in terms of commodity parts that you can purchase off the shelf at a store for use in your desktop / laptop, then sure. But, if speaking in terms of enterprise-level storage that will be used for your Production systems, then have fun explaining to whomever controls the budget that they shouldn't reject the million-plus-dollar SAN that you want because it is "cheap" ;-).

畅游UCS-2流会遇到什么问题?

What problems might I encounter by swimming up the UCS-2 stream?

我想不到的.好吧,只要您不遵循任何可怕的建议来执行诸如实现该UDT或将所有字符串转换为VARBINARY或对所有字符串字段使用NVARCHAR(MAX) ;-).但是在所有您可能担心的事情中,使用UCS-2/UTF-16的SQL Server不应该是其中之一.

None that I can think of. Well, as long as you don't follow any horrible advice to do something like implementing that UDT, or converting all of the strings to VARBINARY, or using NVARCHAR(MAX) for all string fields ;-). But of all of the things you could worry about, SQL Server using UCS-2 / UTF-16 shouldn't be one of them.

但是,如果由于某种原因这个对UTF-8不提供本机支持的问题非常重要,那么您可能需要找到另一个允许UTF-8使用的RDBMS.

But, if for some reason this issue of no native support for UTF-8 is super important, then you might need to find another RDBMS to use that does allow for UTF-8.

更新2018-10-02

虽然这不是一个可行的选择,但是SQL Server 2019在VARCHAR/CHAR数据类型中引入了对UTF-8的本机支持.当前有太多的错误需要使用,但是,如果它们已修复,则对于 some 场景,这是一个选项.请参阅我的帖子,"

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

这篇关于UCS-2和SQL Server的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆