在使用DOMDocument函数进行处理之前,修复PHP中格式错误的XML [英] Fix malformed XML in PHP before processing using DOMDocument functions

查看:90
本文介绍了在使用DOMDocument函数进行处理之前,修复PHP中格式错误的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将XML文档从外部源加载到PHP中. XML没有声明其编码,并且包含非法字符,例如&.如果我尝试直接在浏览器中加载XML文档,则在使用PHP加载文件时也会收到诸如在文本内容中发现无效字符"之类的错误消息,并且还会收到很多警告,例如xmlParseEntityRef: no name in EntityInput is not proper UTF-8, indicate encoding ! Bytes: 0x9C 0x31 0x21 0x3C.

很明显,XML格式不正确,并且包含应转换为XML实体的非法字符.

这是因为XML提要由许多其他用户提供的数据组成,并且很明显,在我获得它之前,尚未对其进行验证或重新格式化.

我已经与XML feed的供应商进行了交谈,他们说他们正试图让内容提供商对其进行分类,但这似乎很愚蠢,因为他们应该首先验证输入.

我基本上需要修复XML,以纠正任何编码错误并将任何非法字符转换为XML实体,以便在使用PHP的DOMDocument函数时XML加载问题.

我的代码当前如下所示:

  $feedURL = '3704017_14022010_050004.xml';
  $dom = new DOMDocument();
  $dom->load($feedURL);

显示编码问题的XML文件示例(点击下载): feed.xml

示例XML,其中包含尚未转换为XML实体的字符:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

解决方案

尝试使用Tidy库,该库可用于清理不良的HTML和XML http://php.net/manual/zh/book.tidy.php

一个纯PHP解决方案,用于修复如下所示的XML:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

会是这样的:

  function cleanupXML($xml) {
    $xmlOut = '';
    $inTag = false;
    $xmlLen = strlen($xml);
    for($i=0; $i < $xmlLen; ++$i) {
        $char = $xml[$i];
        // $nextChar = $xml[$i+1];
        switch ($char) {
        case '<':
          if (!$inTag) {
              // Seek forward for the next tag boundry
              for($j = $i+1; $j < $xmlLen; ++$j) {
                 $nextChar = $xml[$j];
                 switch($nextChar) {
                 case '<':  // Means a < in text
                   $char = htmlentities($char);
                   break 2;
                 case '>':  // Means we are in a tag
                   $inTag = true;
                   break 2;
                 }
              }
          } else {
             $char = htmlentities($char);
          }
          break;
        case '>':
          if (!$inTag) {  // No need to seek ahead here
             $char = htmlentities($char);
          } else {
             $inTag = false;
          }
          break;
        default:
          if (!$inTag) {
             $char = htmlentities($char);
          }
          break;
        }
        $xmlOut .= $char;
    }
    return $xmlOut;
  }

这是一个简单的状态机,它指出我们是否在标记中,如果没有,则使用htmlentities对文本进行编码.

值得注意的是,这将占用大文件的内存,因此您可能希望将其重写为流插件或预处理器.

I'm needing to load an XML document into PHP that comes from an external source. The XML does not declare it's encoding and contains illegal characters like &. If I try to load the XML document directly in the browser I get errors like "An invalid character was found in text content" also when loading the file in PHP I get lots of warnings like: xmlParseEntityRef: no name in Entity and Input is not proper UTF-8, indicate encoding ! Bytes: 0x9C 0x31 0x21 0x3C.

It's clear that the XML is not well formed and contains illegal characters that should be converted to XML entities.

This is because the XML feed is made up of data supplied by lots of other users and clearly it's not being validated or reformatted before I get it.

I've spoken to the supplier of the XML feed and they say they are trying to get the content providers to sort it out, but this seems silly as they should be validating the input first.

I basically need to fix the XML correcting any encoding errors and converting any illegal chars to XML entities so that the XML loads problem when using PHP's DOMDocument functions.

My code currently looks like:

  $feedURL = '3704017_14022010_050004.xml';
  $dom = new DOMDocument();
  $dom->load($feedURL);

Example XML file showing encoding issue (click to download): feed.xml

Example XML that contains chars that have not been converted to XML entities:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

解决方案

Try using the Tidy library which can be used to clean up bad HTML and XML http://php.net/manual/en/book.tidy.php

A pure PHP solution to fix some XML like this:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

Would be something like this:

  function cleanupXML($xml) {
    $xmlOut = '';
    $inTag = false;
    $xmlLen = strlen($xml);
    for($i=0; $i < $xmlLen; ++$i) {
        $char = $xml[$i];
        // $nextChar = $xml[$i+1];
        switch ($char) {
        case '<':
          if (!$inTag) {
              // Seek forward for the next tag boundry
              for($j = $i+1; $j < $xmlLen; ++$j) {
                 $nextChar = $xml[$j];
                 switch($nextChar) {
                 case '<':  // Means a < in text
                   $char = htmlentities($char);
                   break 2;
                 case '>':  // Means we are in a tag
                   $inTag = true;
                   break 2;
                 }
              }
          } else {
             $char = htmlentities($char);
          }
          break;
        case '>':
          if (!$inTag) {  // No need to seek ahead here
             $char = htmlentities($char);
          } else {
             $inTag = false;
          }
          break;
        default:
          if (!$inTag) {
             $char = htmlentities($char);
          }
          break;
        }
        $xmlOut .= $char;
    }
    return $xmlOut;
  }

Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.

It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.

这篇关于在使用DOMDocument函数进行处理之前,修复PHP中格式错误的XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆