Unicode到非Unicode的转换 [英] Unicode to Non-unicode conversion

查看:202
本文介绍了Unicode到非Unicode的转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个名为PostalCode的NVarchar字段中有一些unicode字符。当我将它们转换为Varchar时,在结果中有一个



我的代码是:

 选择PostalCode,cast((PostalCode)as varchar))as val from 
pre>

,结果是:

  PostalCode | val 
053000 | 053000?

这里我得到一个'?有没有办法删除这些特殊字符?

解决方案

这里有几点要注意:


  1. 如果您想要确切地看到哪个字符,可以将值转换为 VARBINARY 它将给你字符串中所有字符的十六进制/二进制值,并且在十六进制中没有隐藏字符的概念:

      DECLARE @PostalCode NVARCHAR(20); 
    SET @PostalCode = N'053000'+ NCHAR(0x2008); - 0x2008 =标点空格
    SELECT @PostalCode AS [NVarCharValue],
    CONVERT(VARCHAR(20),@PostalCode)AS [VarCharValue],
    CONVERT(VARCHAR RTRIM(@PostalCode))AS [RTrimmedVarCharValue],
    CONVERT(VARBINARY(20),@PostalCode)AS [VarBinaryValue];

    返回:

      NVarCharValue VarCharValue RTrimmedVarCharValue VarBinaryValue 
    053000 053000? 053000? 0x3000350033003000300030000820

    NVARCHAR 数据存储为UTF- 16,工作在2字节集。看看最后4个十六进制数字,看看隐藏的2字节集是什么,我们看到0820。由于Windows和SQL Server是UTF-16小端(即UTF-16LE),字节是相反的顺序。翻转最后的2个字节 - 08 20 - 我们得到2008,这是 我们通过 NCHAR(0x2008)添加



    另外,请注意 RTRIM


  2. 简单来说,你可以用空白代替问号:

      SELECT REPLACE(CONVERT(VARCHAR(20),[PostalCode]),'?','');更重要的是,您应该将 [PostalCode] 

    转换为


  3. 字段为 VARCHAR ,以便它不存储这些字符。没有国家使用在ASCII字符集中没有表示并且对VARCHAR数据类型无效的字母,至少我已经阅读过的内容(参见下面的参考部分)。事实上,允许的是一个相当小的ASCII子集,这意味着你可以很容易地过滤(或者只是做同样 REPLACE 如上所示,当插入或更新):

      ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR ]?空值; 

    请务必检查当前 NULL NOT NULL 设置并使它在上面的ALTER语句中相同,否则可以改变为默认值 NULL

  4. 如果您无法更改表格的架构并需要定期清理坏数据,运行以下命令:

     ; WITH cte AS 

    SELECT *
    FROM TableName
    WHERE [PostalCode]<>
    CONVERT(NVARCHAR(50),CONVERT(VARCHAR(50),[PostalCode]))

    UPDATE cte
    SET cte。[PostalCode] = REPLACE(CONVERT(VARCHAR(50),[PostalCode]),'?','')

    请记住,如果表有数百万行,上述查询并不意味着有效工作。此时,需要通过循环以较小的集合处理。







为了参考,以下是关于邮政编码的维基百科文章,该文章目前指出只有使用的字元是:



  • 阿拉伯数字「0」至「9」

  • ISO基本拉丁字母的字母

  • 空格,连字符


关于字段的最大大小,这里是维基百科邮政编码


I have some unicode characters in an NVarchar field named PostalCode. When I convert them to Varchar, there is a ? in the result.

My code is:

select PostalCode, cast((PostalCode) as varchar)) as val from  table

and the result is:

PostalCode       |   val
053000           | 053000?

Here I am getting a '?' in the result. Is there any way to remove such special characters?

解决方案

There are a few things to note here:

  1. If you want to see exactly which character is there, you can convert the value to VARBINARY which will give you the hex / binary value of all characters in the string and there is no concept of "hidden" characters in hex:

    DECLARE @PostalCode NVARCHAR(20);
    SET @PostalCode = N'053000'+ NCHAR(0x2008); -- 0x2008 = "Punctuation Space"
    SELECT @PostalCode AS [NVarCharValue],
           CONVERT(VARCHAR(20), @PostalCode) AS [VarCharValue],
           CONVERT(VARCHAR(20), RTRIM(@PostalCode)) AS [RTrimmedVarCharValue],
           CONVERT(VARBINARY(20), @PostalCode) AS [VarBinaryValue];
    

    Returns:

    NVarCharValue   VarCharValue   RTrimmedVarCharValue   VarBinaryValue
    053000          053000?        053000?                0x3000350033003000300030000820
    

    NVARCHAR data is stored as UTF-16 which works in 2-byte sets. Looking at the last 4 hex digits to see what the hidden 2-byte set is, we see "0820". Since Windows and SQL Server are UTF-16 Little Endian (i.e. UTF-16LE), the bytes are in reverse order. Flipping the final 2 bytes -- 08 and 20 -- we get "2008", which is the "Punctuation Space" that we added via NCHAR(0x2008).

    Also, please note that RTRIM did not help at all here.

  2. Simplistically, you can just replace the question marks with nothing:

    SELECT REPLACE(CONVERT(VARCHAR(20), [PostalCode]), '?', '');
    

  3. More importantly, you should convert the [PostalCode] field to VARCHAR so that it doesn't store these characters. No country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype, at least as far as I have ever read about (see bottom section for references). In fact, what is allowed is a rather small subset of ASCII, which means you can easily filter on the way in (or just do the same REPLACE as shown above when inserting or updating):

    ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR(20) [NOT]? NULL;
    

    Be sure to check the current NULL / NOT NULL setting for the column and make it the same in the ALTER statement above, else it could be changed as the default is NULL if not specified.

  4. If you cannot change the schema of the table and need to do a periodic "cleansing" of the bad data, you can run the following:

    ;WITH cte AS
    (
       SELECT *
       FROM   TableName
       WHERE  [PostalCode] <>
                      CONVERT(NVARCHAR(50), CONVERT(VARCHAR(50), [PostalCode]))
    )
    UPDATE cte
    SET    cte.[PostalCode] = REPLACE(CONVERT(VARCHAR(50), [PostalCode]), '?', '');
    

    Please keep in mind that the above query is not meant to work efficiently if the table has millions of rows. At that point it would need to be handled in smaller sets via a loop.


For reference, here is the wikipedia article for Postal code, which currently states that the only characters ever used are:

  • The arabic numerals "0" to "9"
  • Letters of the ISO basic Latin alphabet
  • Spaces, hyphens

And regarding the max size of the field, here is the Wikipedia List of postal codes

这篇关于Unicode到非Unicode的转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆