在XML中编码二进制数据:有比base64更好的替代品吗? [英] Encoding binary data within XML: Are there better alternatives than base64?

查看:775
本文介绍了在XML中编码二进制数据:有比base64更好的替代品吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编码和解码XML文件中的二进制数据(使用Python,但无论如何)。我必须面对一个事实,一个XML标签内容有非法字符。唯一允许的操作在 XML规格中说明:

  Char :: =#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

这意味着不允许的是:




  • 29除了0x09,0x0A,0x0D
  • $ b $,Unicode控制字符是非法的(0x00 - 0x20)ie( 000xxxxx b
  • 任何超过2个字节(UTF-16 +)的Unicode字符表示都是非法的(U + D800 - U + DFFF)ie( 11011xxx

  • 特殊的Unicode非字符是非法的(0xFFFE - 0xFFFF)ie( 11111111 1111111x

  • <根据实体内容的此帖子



<1> 1字节可以编码256个可能。有了这些限制,第一个字节限制为256-29-8-1-3 = 215可能性



base64 仅使用64个可用性。 Base64生成33%的开销(6位变为1字节,一旦用base64编码)。



所以我的问题很简单:编码XML中的二进制数据?如果没有,我们应该从哪里开始创建它?(图书馆等)



注意: t使用XML来编码二进制数据,因为...。只是不要。你最多可以争辩为什么不使用错误的XML解析器支持的215种可能性。



NB2:我不是说第二个字节,但肯定有一些注意事项当我们使用补充Unicode平面(如果不是?),可以发展关于可能性的数量和事实,它应该开始由10xxxxxx以遵守UTF8标准。



这个项目在GitHub上,最终被称为BaseXML: https://github.com/kriswebdev/BaseXML



它有一个20%的开销,这是一个二进制安全版本。



我很难使它与Expat一起工作,这是幕后的XML解析器Python(不支持XML1.1!)。所以你会发现BaseXML1.0二进制安全版本的XML1.0。



我可能会发布for XML1.1版本后,如果要求(它是也是二进制安全的,有一个14.7%的压缩率),它已经准备好,工作确实,但没有用Python内置的XML解析器,所以我不想混淆人们太多的版本(还)。


I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • <, >, & according to this post for entities content

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

解决方案

I have developed the concept in a C code.

The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML

It has a 20% overhead, which is good for a binary safe version.

I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.

I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% compression ratio), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).

这篇关于在XML中编码二进制数据:有比base64更好的替代品吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆