如何避免XML :: LibXML中的双重UTF-8编码 [英] How do I avoid double UTF-8 encoding in XML::LibXML

查看:164
本文介绍了如何避免XML :: LibXML中的双重UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序从数据源接收UTF-8编码的字符串.我需要篡改这些字符串,然后将它们作为XML结构的一部分输出. 当我序列化我的XML文档时,它将被双重编码并因此被破坏.当我仅序列化根元素时,就可以了,但是当然没有标题了.

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.

下面是一段代码,试图将问题可视化:

Here's a piece of code trying to visualize the problem:

use strict; use diagnostics;    use feature 'unicode_strings';
use utf8;   use v5.14;      use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)");    use open qw( :encoding(UTF-8) :std );
use XML::LibXML

# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";

# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" );    $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );

# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );

剧透1:很好的情况:

< response><结果"'Üßıçñíïì'的值没有意义</result></response>

<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>

剧透2:糟糕的情况:

<?xml version ="1.0" encoding ="UTF-8"?>< response>< result>'ÂÃÃıçñÃì'的值没有任何意义</result></response>

<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>

根元素及其内容已被UTF-8编码. XML :: LibXML接受输入并能够对其进行处理,然后将其再次输出为有效的UTF-8.一旦我尝试序列化整个XML文档,内部的宽字符就会被弄乱.在十六进制转储中,看起来好像已经由UTF-8编码的字符串再次通过UTF-8编码器传递.我一直在 Perl自己的Unicode教程中进行搜索,尝试和阅读很多东西tchrist

The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.

我需要做什么才能输出包含头的完整XML文档,以便其内容保持正确的编码?是否要设置标志/属性/开关?

What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?

(我很乐意接受指向 TFM 的相应部分的链接,只要它们确实有帮助,我就应该拥有 R ;)

(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)

推荐答案

ikegami是正确的,但他并没有真正解释出什么问题.引用 XML :: LibXML :: Document 的文档:

ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:

重要提示:与其他节点的toString不同,在文档节点上,此函数以文档的原始编码(请参见ActualEncoding()方法)将XML作为字节字符串返回!

IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!

(serialize只是toString的别名)

在将字节字符串打印到标记有:encoding层的文件句柄时,它的编码方式就好像是ISO-8859-1.由于您的字符串包含UTF-8字节,因此将进行双重编码.

When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.

如池上所说,请使用binmode(STDOUT)从STDOUT中删除编码层.您也可以在打印前将decode的结果serialize转换回字符,但是假设文档使用的是您在输出文件句柄上设置的相同编码. (否则,将发出一个XML文档,该XML文档的实际编码与标题所要求的不匹配.)如果要打印到文件而不是STDOUT,请使用'>:raw'打开它,以避免双重编码.

As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.

这篇关于如何避免XML :: LibXML中的双重UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆