php utf-8从xml解码返回问号 [英] php utf-8 decode from xml returns question marks

查看:96
本文介绍了php utf-8从xml解码返回问号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用xml时遇到一些问题。我知道这是一个常见问题,但是我发现的答案并不能解决我的问题。问题是,当我使用php domdocument将é或ä或另一个特殊字符添加到我的xml文件中时,会将é另存为xE9,将ä保存为xE4。我不知道是否可以,但是当我想显示输出时,它会在此位置显示问号。
我尝试了很多。像在php domdocument中的de xml标头中删除并添加编码。我还尝试使用file_get_contents并使用php utf-8_decode获取xml。我尝试使用iso intead,但没有任何方法解决我的问题。相反,有时我会得到php xml解析错误。我必须做错什么,但是怎么办?那就是我的问题以及如何解决这个问题。
我的xml文件如下所示:
xE9和xE4具有黑色背景。

I have some problems using xml. I know this is a comon question, but the answers i found didn't fix my problem. The problem is that when I add é or ä or another special char to my xml file, with php domdocument, it saves the é as xE9 and the ä as xE4. I don't know if this is ok but when I want to show the output it shows question marks at this places. I have tried alot. Like removing and adding the encoding in de xml header in the php domdocument. I also tried using file_get_contents and use php utf-8_decode to get the xml. I tried using iso intead, but nothing solved my problem. Instead I got php xml parse errors sometimes. I must do something wrong, but what? Thats my question and how I can solve this problem. My xml file looks like this: the xE9 and the xE4 have black backgrounds.

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <row id="1">
    <question>blah</question>
    <answer>blah</answer>
  </row>
  <row id="2">
    <question>xE9</question>
    <answer>xE4</answer>
  </row>
</root>

和我的php xml类的一部分

and a part of my php xml class

function __construct($filePath) {
    $this->file = $filePath;
    $this->label = array('Vraag', 'Antwoord');
    $xmlStr = file_get_contents($filePath);
    $xmlStr = utf8_decode($xmlStr);
    $this->xmlDoc = new DOMDocument('1.0', 'UTF-8');
    $this->xmlDoc->preserveWhiteSpace = false;
    $this->xmlDoc->formatOutput = true;
    //$this->xmlDoc->load($filePath);   
    $this->xmlDoc->loadXML($xmlStr);
}       

这是添加新行的功能

//creates new xml row and saves it in xml file
function addNewRow($question, $answer) {
    $nextAttr = $this->getNextRowId();
    $parentNode = $this->xmlDoc->documentElement;
    $rowNode = $this->xmlDoc->createElement('row');
    $rowNode = $parentNode->appendChild($rowNode);
    $rowNode->setAttribute('id', $nextAttr);    
    $q = $this->xmlDoc->createElement('question');
    $q = $rowNode->appendChild($q);
    $qText = $this->xmlDoc->createTextNode($question);
    $qText = $q->appendChild($qText);
    $a = $this->xmlDoc->createElement('answer');
    $a = $rowNode->appendChild($a);
    $aText = $this->xmlDoc->createTextNode($answer);
    $aText = $a->appendChild($aText);
    $this->xmlDoc->save($this->file);
}

一切正常,直到我添加了特殊字符。那些显示为问号。

everything works fine till I add spcial chars. Those are shown as questionmarks.

推荐答案

好吧,下面的内容有些粗糙/冗长,尤其是您已经尝试了很多。试着保持新鲜的眼光,并考虑一下,一旦您对编码仅犯了一点错误,它通常就已经被搞砸了。因此,正确了解此处使用的机制很重要。

Okay the following is now a bit rough/verbose, especially as you already tried so much. Just try to keep fresh eyes and consider that once you do only a little mistake with encoding, it is often already screwed. Therefore it is important to properly understand which mechanics are at work here.

我尝试解决其中一些在PHP DOMDocument中运行的机制。您可能会发现这很有趣或令人生畏,也许即使到最后,解决方案也非常简单,甚至不需要更改PHP代码,但是无论如何我想解决这个问题,因为Stackoverflow和PHP手册,最好有更多参考资料,因为正确理解它非常重要-正如我已经写过的。

I try to address some of these mechanics that are operating in DOMDocument in PHP. You might find this interesting or daunting and perhaps even at the end the solution is very simple and you don't even need to change your PHP code, but I'd like to address this anyway because it is not much documented on Stackoverflow and the PHP manual and it's good to have more reference material as it is important to properly understand - as I already wrote.

因此,默认情况下XML是UTF-8。 UTF-8几乎是当今互联网的理想选择。当然,这并不是在所有情况下都完全正确,但总的来说,这是一个安全的选择。因此,单独使用XML并使用默认编码UTF-8是非常好的。

So by default XML is in UTF-8. UTF-8 is pretty much the perfect choice for the internet nowadays. Sure this is not totally true in and for all cases, but generally, it is a safe bet. So XML on it's own and with it's default encoding UTF-8 is super fine.

这对DOMDocument意味着什么?只是默认情况下DOMDocument会采用这种编码,因此我们无需在意。这是一个简单的显示,输出注释如下:

What does this mean for DOMDocument? Just that by default DOMDocument will take this encoding and we do not need to care about that. Here is a simple show of that, output follows commented:

$doc = new DOMDocument();
$doc->save('php://output');
# <?xml version="1.0"?>

这个非常简短的示例显示了PHP对DOMDocument使用的默认UTF-8编码。通过不在XML声明中指定一个,该文档甚至仍不包含根节点已经显示了默认的XML UTF-8编码<?xml version = 1.0 ?>

This very short example shows the default UTF-8 encoding PHP has for the DOMDocument. This document even still not containing a root-node already shows the default XML UTF-8 encoding by not specifying one in the XML declaration: <?xml version="1.0"?>.

所以您可能会说但我想要,并且可以。这是在调用构造函数时DOMDocument的 encoding 参数的作用。

So you might say "but I want", and sure you can. This is what the encoding parameter of DOMDocument is for when you call the constructor:

$doc = new DOMDocument('1.0', 'UTF-8');
                               #####  Encoding Parameter
$doc->save('php://output');
# <?xml version="1.0" encoding="UTF-8"?>

如下所示,我们用作第一个( version )和第二个( encoding )参数将被写出。是的,我们可以做不允许做的事情。但是,该 XML声明允许什么?有一个XML版本AFAIK,即1.0。因此,version参数必须始终为1.0。编码允许什么? XML规范说所有IANA字符集,总之应该成为以下常见的一种(应该,不是必须):UTF-8,UTF-16,ISO-10646-UCS-2,ISO-10646-UCS-4,ISO-8859-1至ISO-8859-9,ISO -2022-JP,Shift_JIS,EUC-JP。好的,这已经是一长串了。

As this shows, what we use as first (version) and second (encoding) parameter will be written out. So yes, we can do things that are not allowed. But what is allowed in this XML Declaration? There is one XML version AFAIK and that is 1.0. Therefore the version parameter must be 1.0 always. And what is allowed for the encodings? XML specs say all the IANA characters sets, in short it should be one of these common ones (should, not must): UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, EUC-JP. Okay wow, this already is a long list.

因此,让我们看看PHP的DOMDocument实际上允许我们做什么:

So lets take a look what does PHP's DOMDocument allow us practically:

$doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'UTF-8');
$doc->save('php://output');
# <?xml version="♥♥ love, hugs and kisses ♥♥" encoding="UTF-8"?>

编码按预期方式工作,版本很整洁,但显示:这是使用Unicode字符编码的作为UTF-8。现在让我们将编码更改为其他内容:

The encoding works as expected, the version is cosmetic, but it shows: This is using Unicode characters encoded as UTF-8. Now let's change the encoding to something different:

$doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'ISO-8859-1');
$doc->save('php://output');
# <?xml version="&#9829;&#9829; love, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>

因为Unicode心在 ISO-8859-1 ,将它们替换为其相应的数字HTML实体(&#9829; )。如果我们添加 ISO-8859-1 字符(如ö(PHP \xF6 )直接在其中?

Because the Unicode hearts do not have a place in ISO-8859-1, they are replaced with their according numeric HTML entity (&#9829;). And what happens if we add an ISO-8859-1 character like ö (binary string in PHP "\xF6") directly in there?

$doc = new DOMDocument("♥♥ l\xF6ve, hugs and kisses ♥♥", 'ISO-8859-1');
$doc->save('php://output');
# Warning: DOMDocument::save(): output conversion failed due to conv error, 
#          bytes 0xF6 0x76 0x65 0x2C
#                ^^^^  |    |    |
#                "ö"   v    e   space

这不起作用。 DOMDocument告诉我们,我们提供的信息不能转换为 ISO-8859-1 输出。这是预期的:DOMDocument期望给定的所有输入均为UTF-8。因此,这次让我们从unicode中获取ö:

This does not work. DOMDocument tells us that the information we have provided can not be turned into ISO-8859-1 output. This is expected: DOMDocument expects all input given being UTF-8. So lets take ö from unicode this time:

$doc = new DOMDocument('♥♥ löve, hugs and kisses ♥♥', 'ISO-8859-1');
$doc->save('php://output');
# <?xml version="&#9829;&#9829; l�ve, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>

尽管在钻石中有这个问号,但现在看起来还不错。因为在我的计算机上,显示/输出是以UTF-8格式显示的,所以在此处无法显示 ISO-8859-1 ö字符。因此,我的显示将其替换为。Unicode字符'REPLACEMENT CHARACTER'(U + FFFD)。没错,ö现在可以使用。

This looks now fine despite this question mark in a diamond. Because on my computer the display/output is in UTF-8 it can not display the ISO-8859-1 ö character here. So my display replaces it with the � Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). Which is correct, the "ö" now works.

到目前为止,这很清楚,您只能将UTF-8编码的字符串传递到DOMDocument中,而与XML无关为该文档指定的编码。

This so far makes clear that you can only pass UTF-8 encoded strings into DOMDocument and that is regardless of the XML encoding you have specified for that document.

因此,让我们用问题中的UTF-8文档来打破此规则,并添加一些非UTF-8文本,例如在ISO-8859-1中。 Windows-1252:

So let's break this rule with an UTF-8 document as in your question and add some non-UTF-8 text, for example in ISO-8859-1 resp. Windows-1252:

$doc = new DOMDocument('1.0', 'UTF-8');

$doc->appendChild($doc->createElement('root'))
    ->appendChild($doc->createElement('question'))
    ->appendChild($doc->createTextNode("l\xF6ve, hugs and kisses"));

$doc->save('php://output');
# <?xml version="1.0" encoding="UTF-8"?>
# <root><question>l�ve, hugs and kisses</question></root>

根据您使用哪个程序查看输出,它可能不会显示问号,而只会显示 xF6 。我会说文件编辑器就是这种情况。

Depending with which program you view the output, it might show not the question mark � but just "xF6". I would say that is the case with your file-editor.

所以这也是解决方案:当您将字符串数据传递到DOMDocument中时,请确保它是UTF- 8编码:

So this is also the solution: When you pass in string-data into DOMDocument, ensure it is UTF-8 encoded:

->appendChild($doc->createTextNode(utf8_encode("l\xF6ve, hugs and kisses")));
                                   ########### (works with ISO-8859-1 only (!))

# <?xml version="1.0" encoding="UTF-8"?>
# <root><question>löve, hugs and kisses</question></root>

或者在您的情况下,告诉浏览器您的网站需要UTF-8。然后,您无需重新编码任何内容,因为您的浏览器已经使用正确的编码发送了数据。 W3C已经为该主题收集了一些有用的资源,我建议您现在阅读:

Or in your case, tell the browser that your website expects UTF-8. Then you don't need to re-encode anything because your browser already sends the data in with the right encoding. The W3C has collected some useful resources for the topic I suggest you to read now:

  • Multilingual form encoding

这篇关于php utf-8从xml解码返回问号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆