使用 libxml2 解析 xml 时处理代理对 [英] Handling Surrogate pairs while parsing xml using libxml2

查看:19
本文介绍了使用 libxml2 解析 xml 时处理代理对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 libxml2 解析 xml.但是,有时我得到的代理对的代码点超出了 http://www.w3.org/TR/REC-xml/#NT-Char
因此,我的 libxml2 解析器无法解析它,因此出现错误.有人可以告诉我如何在使用 libxml2 解析 XML 时处理代理对.

I am trying to parse xml using libxml2. However, sometimes I get code points of surrogate pairs in it which are outside the range specified in http://www.w3.org/TR/REC-xml/#NT-Char
Because of this, my libxml2 parser is not able to parse it and thus I get error. Can somebody tell me how to handle surrogate pairs while parsing XML using libxml2.

我要解析的示例 xml 是:

An example xml I want to parse is:

<?xml version="1.0" encoding="UTF-8"?>
<message><body>  &#xD83D;&#xD83D;</body></message>

推荐答案

请注意,xD83D 是一个高代理项.一个代理对由一个高代理和一个低代理组成;两个高代理并排不是代理对",这是胡说八道.

Note that xD83D is a high surrogate. A surrogate pair consists of a high surrogate and a low surrogate; having two high surrogates next to each other is not a "surrogate pair", it is nonsense.

另请注意,在 XML 中表示非 BMP 字符的正确方法是作为组合字符的单个字符引用,例如 &#x120AB;.在某些字符编码中需要将非 BMP 字符拆分为两个代理项,但在 XML 字符引用中不需要(或不允许).XML 中的字符引用表示 Unicode 代码点,而不是特定于特定字符编码的数值.

Also note that the correct way to represent a non-BMP character in XML is as a single character reference for the combined character, for example &#x120AB;. Splitting a non-BMP character into two surrogates is needed in some character encodings, but it is not needed (or allowed) in XML character references. Character references in XML represent Unicode code-points, not the numeric values specific to a particular character encoding.

如果您无法修复创建此错误 XML 的程序,那么最好的方法是使用脚本修复它,例如在 Perl 中查找无效的字符引用对并用正确的 XML 表示替换它们.

If you can't fix the program that created this bad XML, then the best approach would be to repair it using a script e.g. in Perl that looks for the invalid character references pairs and replaces them with the correct XML representation.

这篇关于使用 libxml2 解析 xml 时处理代理对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆