手动解析无效的XML [英] parse invalid XML manually

查看:84
本文介绍了手动解析无效的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的XML无效,文件本身存在很多问题,我需要每天从该文件重新导入.结构如下:

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ have an XML that is not valid, there are many problems in the file itself, and I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ need to do daily reimports from that file. The structure looks like this:

<products>
    <product no="AP1222-00" name="Colours kravata" price="456" currency="Kč">
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s černým poutkem.</description>
    </product>
    <product no="AP1222-22" name="Colours kravata" price="330" currency="Kč">
        <description name="POPIS PRODUKTU">Blabla.</description>
    </product>
</products>

是否有任何简单的方法来获取产品阵列,因此我可以在导入之前解决文件中的问题? SimpleXML等不起作用,因为文件无效.

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­s there any easy way to get the array of products, so I can fix the problems in t­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­he files before importing it? SimpleXML etc. don't work, as the file is invalid.

这是XML的完整产品供参考,请注意产品名称中的双引号:

Here's one complete products of the XML for reference, notice the double quotes in product name:

<products>
    <product no="AP1222-00" name="" Colours" kravata" price="456" currency="Kč">
        <folders>
            <folder category="<b>COOL 2017</b>" subcategory="TEXTILE & FASHION"/>
            <folder category="TEXTILE & FASHION" subcategory="Kravaty a šály"/>
        </folders>
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s
            černým poutkem.
        </description>
        <properties>
            <property name="KS / KARTON" value="100"/>
            <property name="HMOTNOST KARTONU" value="6"/>
            <property name="NETTO HMOTNOST / KARTON" value="5"/>
            <property name="DIM1" value="15"/>
            <property name="DIM2" value="80"/>
            <property name="DIM3" value="35"/>
            <property name="TECHNOLIGIE POTISKU" value="T1 (8C, 50×80 MM)"/>
            <property name="TARIF" value="6215200090"/>
            <property name="Min. mn. (ks)" value=""/>
            <property name="M3/CARTON" value="0.042"/>
            <property name="COOL 2017 KAPITOLA" value="TEXTILE AND FASHION"/>
            <property name="COOL 2017 STRANY" value="525"/>
            <property name="main category" value="fashion"/>
        </properties>
        <images>
            <image src="http://www.andapresent.com/kepek/cms/original/83653.jpg"/>
        </images>
        <stocks>
            <stock name="navi_central" value="2"/>
            <stock name="navi_arrive" value="" date=""/>
            <stock name="eu_central" value="" date=""/>
            <stock name="eu_arrive_1" value="" date=""/>
            <stock name="eu_arive_2" value="" date=""/>
        </stocks>
    </product>
</products>

推荐答案

DOMDocument::loadHTML方法比XML解析器宽容,并且能够自动修复许多错误.问题是您无法控制libxml如何修复这些错误.

DOMDocument::loadHTML method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.

这就是为什么我建议使用DOMDocument::loadXML (使用XML解析器)的另一种方法,但是这次我将尝试使用自定义规则(不是通用的)来纠正错误.可以修复,但要适合特定情况)

That's why I suggest an other approach with DOMDocument::loadXML (that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)

当您将libxml_use_internal_errors()切换到true时,所有xml错误都存储在libXMLErr实例的数组中.它们每个都包含错误代码,错误行和错误列. (请注意,第一行和第一列为1).

When you switch libxml_use_internal_errors() to true, all xml errors are stored in an array of libXMLErr instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).

$xml = file_get_contents('file.xml');

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();

if ($errors) {
    // LIBXML constant name, LIBXML error code // LIBXML error message
    define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
    define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
    define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

    $rules = [
        XML_ERR_LT_IN_ATTRIBUTE => [
            'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
            'replacement' => [ 'string' => '&lt;', 'size' => 3 ]
        ],
        XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
            'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
            'replacement' => [ 'string' => '&quot;$1&quot;', 'size' => 10 ]
        ],
        XML_ERR_NAME_REQUIRED => [
            'pattern' => '~^.{%d}[^&]*\K&~',
            'replacement' => [ 'string' => '&amp;', 'size' => 4 ]
        ]
    ];

    $previousLineNo = 0;
    $lines = explode("\n", $xml);

    foreach ($errors as $error) {

        if (!isset($rules[$error->code])) continue;

        $currentLineNo = $error->line;

        if ( $currentLineNo != $previousLineNo )
            $offset = -1;

        $currentLine = &$lines[$currentLineNo - 1];
        $pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
        $currentLine = preg_replace($pattern,
                                    $rules[$error->code]['replacement']['string'],
                                    $currentLine, -1, $count);
        $offset += $rules[$error->code]['replacement']['size'] * $count;
        $previousLineNo = $currentLineNo;
    }

    $xml = implode("\n", $lines);

    libxml_clear_errors();
    $dom->loadXML($xml);
    $errors = libxml_get_errors();
}

var_dump($errors);

$s = simplexml_import_dom($dom);

echo $s->product[0]["name"];

rules数组中的size是替换字符串的大小和替换字符串的大小之间的差.这样,当同一行上有多个错误时,下一个错误的位置将用$offset更新.

The size in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset.

libxml错误常量在PHP中不可用,这就是手动定义它们的原因(仅是为了使代码更具可读性).您可以在此处找到它们.

libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.

这篇关于手动解析无效的XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆