SQL-UTF-8到varchar / nvarchar编码问题 [英] SQL - UTF-8 to varchar/nvarchar Encoding issue

查看:247
本文介绍了SQL-UTF-8到varchar / nvarchar编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景-我正在从网站接收以UTF-8编码的json格式的响应数据。 json的body属性具有以base64binary类型存储的值,我将其存储为ms sql服务器上的nvarchar类型。而不是双引号)表示存在编码问题-这是我问这个问题的原因。


请参阅以下剖析的代码和可运行的代码


在转换过程中注意有趣的字符。


例如。代表IRB控股公司( 公司)


以下问题已解决上述问题-我看到<
<$ p ,但应该在包含'&'的行中失败,这是xml中的特殊字符。

  select convert(xml,'<?xml version ='1.0'encoding =='UTF-8'?>'+ convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column( body)))','varbinary(max)')))

以下查询通过使用 replace 语句处理上述问题,并且我能够按预期完全看到所有行。但是此解决方案只能处理'&'


要运行的示例代码:

 声明@t表([body] nvarchar(max))

插入@t(body)
选择'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA =='

()''() 'varbinary(max)'))
,convert(xml,'<?xml version ='1.0'encoding ='UTF-8'?>>'+ replace(convert(varchar(max) ,convert(varchar(max),cast(''as xml).value('xs:base64Binary(sql:column( body))','varbinary(max)'))),'&', '&'))
from @t

问题-我是否需要为其他xml特殊字符添加更多的replace语句-< ,>

解决方案

XML技巧很好用,只是让XML引擎处理字符实体:

 声明@t表([body] nvarchar(max)); 

插入@t(body)
值(’REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJ ==

选择
cast(
cast('<?xml version = 1.0 encoding = UTF-8?>< root><![CDATA ['作为varbinary(max))
+
CAST(''作为xml).value('xs:base64Binary(sql:column( body))','VARBINARY(MAX)')
+
cast(']]>< / root>'为varbinary(max))
为xml).value('。','nvarchar(max)')

@t;

这里的重要部分是:




  • 在字符串文字前面的 N $ code> encoding = UTF-8

  • 我们知道XML声明元素中的字符具有相同的UTF-8表示形式就像在latin1中所做的一样,因此将其强制转换为 varbinary 会得到有效的UTF-8

  • < ;! [CDATA]]> 块。



请注意,它仍然不过是黑客。一旦涉及XML,您就受到XML的限制,并且如果您的字符串包含 XML不能表示的字符,那种XML转换将失败


XML解析:第1行,字符54,非法的xml字符



Background - I am receiving response data from a website in json format encoded in UTF-8. A body attribute of json has values in a base64binary type that I am storing as a nvarchar type on ms sql server.

When I convert that base64binary data to varchar or nvarchar I see funny characters( in place of double quotes) indicating that there is an encoding issue - Which is the reason I am asking this question.

See dissected code as follows and a runnable example at the bottom with my concerns.

Notice funny characters during conversion.

eg. On behalf of IRB Holding Corp (the “Company")

The following query fixes above issue - I see quotes as they should appear, but then it fails on rows containing '&' , which is a special character in xml.

select    convert(xml,  '<?xml version="1.0" encoding="UTF-8"?>' + convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')))

The following query handles above issue by using replace statements and I am able to completely see all the rows as expected. But this solution will only handle the '&' s.

Example code to run:

    declare @t table ( [body] nvarchar(max) ) 
    
    insert into @t(body) 
    select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA=='
    
    select convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))
        , convert(xml, '<?xml version="1.0" encoding="UTF-8"?>'+ replace(convert(varchar(max),convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))),'&','&amp;')) 
from @t

The question - Will I have to add more replace statements for other xml special characters - < , > ?

解决方案

The XML trick works fine, just let the XML engine handle the character entities:

declare @t table ([body] nvarchar(max));

insert into @t(body) 
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');

select
    cast(
        cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
        +
        CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
        +
        cast(']]></root>' as varbinary(max))
    as xml).value('.', 'nvarchar(max)')
from
@t;

The important parts here are:

  • The absence of N in front of the string literals
  • The encoding="UTF-8"
  • The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to varbinary gives valid UTF-8
  • The <![CDATA]]> block.

Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with

XML parsing: line 1, character 54, illegal xml character

这篇关于SQL-UTF-8到varchar / nvarchar编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆