Unicode到非Unicode的转换 [英] Unicode to Non-unicode conversion
问题描述
我在一个名为PostalCode的NVarchar字段中有一些unicode字符。当我将它们转换为Varchar时,在结果中有一个?
。
我的代码是:
选择PostalCode,cast((PostalCode)as varchar))as val from
pre>
,结果是:
PostalCode | val
053000 | 053000?
这里我得到一个'?有没有办法删除这些特殊字符?
解决方案这里有几点要注意:
如果您想要确切地看到哪个字符,可以将值转换为
VARBINARY
它将给你字符串中所有字符的十六进制/二进制值,并且在十六进制中没有隐藏字符的概念:DECLARE @PostalCode NVARCHAR(20);
SET @PostalCode = N'053000'+ NCHAR(0x2008); - 0x2008 =标点空格
SELECT @PostalCode AS [NVarCharValue],
CONVERT(VARCHAR(20),@PostalCode)AS [VarCharValue],
CONVERT(VARCHAR RTRIM(@PostalCode))AS [RTrimmedVarCharValue],
CONVERT(VARBINARY(20),@PostalCode)AS [VarBinaryValue];
返回:
NVarCharValue VarCharValue RTrimmedVarCharValue VarBinaryValue
053000 053000? 053000? 0x3000350033003000300030000820
NVARCHAR
数据存储为UTF- 16,工作在2字节集。看看最后4个十六进制数字,看看隐藏的2字节集是什么,我们看到0820。由于Windows和SQL Server是UTF-16小端(即UTF-16LE),字节是相反的顺序。翻转最后的2个字节 -08
和20
- 我们得到2008,这是 我们通过NCHAR(0x2008)
添加
另外,请注意
RTRIM
简单来说,你可以用空白代替问号:
SELECT REPLACE(CONVERT(VARCHAR(20),[PostalCode]),'?','');更重要的是,您应该将
[PostalCode] 转换为
字段为
VARCHAR
,以便它不存储这些字符。没有国家使用在ASCII字符集中没有表示并且对VARCHAR数据类型无效的字母,至少我已经阅读过的内容(参见下面的参考部分)。事实上,允许的是一个相当小的ASCII子集,这意味着你可以很容易地过滤(或者只是做同样REPLACE
如上所示,当插入或更新):ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR ]?空值;
请务必检查当前
NULL
NOT NULL
设置并使它在上面的ALTER语句中相同,否则可以改变为默认值NULL $ c
如果您无法更改表格的架构并需要定期清理坏数据,运行以下命令:
; WITH cte AS
(
SELECT *
FROM TableName
WHERE [PostalCode]<>
CONVERT(NVARCHAR(50),CONVERT(VARCHAR(50),[PostalCode]))
)
UPDATE cte
SET cte。[PostalCode] = REPLACE(CONVERT(VARCHAR(50),[PostalCode]),'?','')
请记住,如果表有数百万行,上述查询并不意味着有效工作。此时,需要通过循环以较小的集合处理。
为了参考,以下是关于邮政编码的维基百科文章,该文章目前指出只有使用的字元是:
- 阿拉伯数字「0」至「9」
- ISO基本拉丁字母的字母
- 空格,连字符
关于字段的最大大小,这里是维基百科邮政编码
I have some unicode characters in an NVarchar field named PostalCode. When I convert them to Varchar, there is a
?
in the result.My code is:
select PostalCode, cast((PostalCode) as varchar)) as val from table
and the result is:
PostalCode | val 053000 | 053000?
Here I am getting a '?' in the result. Is there any way to remove such special characters?
解决方案There are a few things to note here:
If you want to see exactly which character is there, you can convert the value to
VARBINARY
which will give you the hex / binary value of all characters in the string and there is no concept of "hidden" characters in hex:DECLARE @PostalCode NVARCHAR(20); SET @PostalCode = N'053000'+ NCHAR(0x2008); -- 0x2008 = "Punctuation Space" SELECT @PostalCode AS [NVarCharValue], CONVERT(VARCHAR(20), @PostalCode) AS [VarCharValue], CONVERT(VARCHAR(20), RTRIM(@PostalCode)) AS [RTrimmedVarCharValue], CONVERT(VARBINARY(20), @PostalCode) AS [VarBinaryValue];
Returns:
NVarCharValue VarCharValue RTrimmedVarCharValue VarBinaryValue 053000 053000? 053000? 0x3000350033003000300030000820
NVARCHAR
data is stored as UTF-16 which works in 2-byte sets. Looking at the last 4 hex digits to see what the hidden 2-byte set is, we see "0820". Since Windows and SQL Server are UTF-16 Little Endian (i.e. UTF-16LE), the bytes are in reverse order. Flipping the final 2 bytes --08
and20
-- we get "2008", which is the "Punctuation Space" that we added viaNCHAR(0x2008)
.Also, please note that
RTRIM
did not help at all here.Simplistically, you can just replace the question marks with nothing:
SELECT REPLACE(CONVERT(VARCHAR(20), [PostalCode]), '?', '');
More importantly, you should convert the
[PostalCode]
field toVARCHAR
so that it doesn't store these characters. No country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype, at least as far as I have ever read about (see bottom section for references). In fact, what is allowed is a rather small subset of ASCII, which means you can easily filter on the way in (or just do the sameREPLACE
as shown above when inserting or updating):ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR(20) [NOT]? NULL;
Be sure to check the current
NULL
/NOT NULL
setting for the column and make it the same in the ALTER statement above, else it could be changed as the default isNULL
if not specified.If you cannot change the schema of the table and need to do a periodic "cleansing" of the bad data, you can run the following:
;WITH cte AS ( SELECT * FROM TableName WHERE [PostalCode] <> CONVERT(NVARCHAR(50), CONVERT(VARCHAR(50), [PostalCode])) ) UPDATE cte SET cte.[PostalCode] = REPLACE(CONVERT(VARCHAR(50), [PostalCode]), '?', '');
Please keep in mind that the above query is not meant to work efficiently if the table has millions of rows. At that point it would need to be handled in smaller sets via a loop.
For reference, here is the wikipedia article for Postal code, which currently states that the only characters ever used are:
- The arabic numerals "0" to "9"
- Letters of the ISO basic Latin alphabet
- Spaces, hyphens
And regarding the max size of the field, here is the Wikipedia List of postal codes
这篇关于Unicode到非Unicode的转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!