SQL-UTF-8到varchar / nvarchar编码问题 [英] SQL - UTF-8 to varchar/nvarchar Encoding issue
问题描述
背景-我正在从网站接收以UTF-8编码的json格式的响应数据。 json的body属性具有以base64binary类型存储的值,我将其存储为ms sql服务器上的nvarchar类型。而不是双引号)表示存在编码问题-这是我问这个问题的原因。
请参阅以下剖析的代码和可运行的代码
在转换过程中注意有趣的字符。
例如。代表IRB控股公司( 公司)
以下问题已解决上述问题-我看到<
<$ p ,但应该在包含'&'的行中失败,这是xml中的特殊字符。
select convert(xml,'<?xml version ='1.0'encoding =='UTF-8'?>'+ convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column( body)))','varbinary(max)')))
以下查询通过使用 replace
语句处理上述问题,并且我能够按预期完全看到所有行。但是此解决方案只能处理'&'
。
要运行的示例代码:
声明@t表([body] nvarchar(max))
插入@t(body)
选择'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA =='
()''() 'varbinary(max)'))
,convert(xml,'<?xml version ='1.0'encoding ='UTF-8'?>>'+ replace(convert(varchar(max) ,convert(varchar(max),cast(''as xml).value('xs:base64Binary(sql:column( body))','varbinary(max)'))),'&', '&'))
from @t
问题-我是否需要为其他xml特殊字符添加更多的replace语句-< ,>
?
XML技巧很好用,只是让XML引擎处理字符实体:
声明@t表([body] nvarchar(max));
插入@t(body)
值(’REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJ ==
选择
cast(
cast('<?xml version = 1.0 encoding = UTF-8?>< root><![CDATA ['作为varbinary(max))
+
CAST(''作为xml).value('xs:base64Binary(sql:column( body))','VARBINARY(MAX)')
+
cast(']]>< / root>'为varbinary(max))
为xml).value('。','nvarchar(max)')
从
@t;
这里的重要部分是:
- 在字符串文字前面的
N $ code> encoding = UTF-8
- 我们知道XML声明元素中的字符具有相同的UTF-8表示形式就像在latin1中所做的一样,因此将其强制转换为
varbinary
会得到有效的UTF-8 -
< ;! [CDATA]]>
块。
请注意,它仍然不过是黑客。一旦涉及XML,您就受到XML的限制,并且如果您的字符串包含 XML不能表示的字符,那种XML转换将失败
XML解析:第1行,字符54,非法的xml字符
Background - I am receiving response data from a website in json format encoded in UTF-8. A body attribute of json has values in a base64binary type that I am storing as a nvarchar type on ms sql server.
When I convert that base64binary data to varchar or nvarchar I see funny characters( in place of double quotes) indicating that there is an encoding issue - Which is the reason I am asking this question.
See dissected code as follows and a runnable example at the bottom with my concerns.
Notice funny characters during conversion.
eg. On behalf of IRB Holding Corp (the “Company")
The following query fixes above issue - I see quotes as they should appear, but then it fails on rows containing '&' , which is a special character in xml.
select convert(xml, '<?xml version="1.0" encoding="UTF-8"?>' + convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')))
The following query handles above issue by using replace
statements and I am able to completely see all the rows as expected. But this solution will only handle the '&'
s.
Example code to run:
declare @t table ( [body] nvarchar(max) )
insert into @t(body)
select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA=='
select convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))
, convert(xml, '<?xml version="1.0" encoding="UTF-8"?>'+ replace(convert(varchar(max),convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))),'&','&'))
from @t
The question - Will I have to add more replace statements for other xml special characters - < , >
?
The XML trick works fine, just let the XML engine handle the character entities:
declare @t table ([body] nvarchar(max));
insert into @t(body)
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');
select
cast(
cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
+
CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
+
cast(']]></root>' as varbinary(max))
as xml).value('.', 'nvarchar(max)')
from
@t;
The important parts here are:
- The absence of
N
in front of the string literals - The
encoding="UTF-8"
- The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to
varbinary
gives valid UTF-8 - The
<![CDATA]]>
block.
Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with
XML parsing: line 1, character 54, illegal xml character
这篇关于SQL-UTF-8到varchar / nvarchar编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!