如何获取不同 XML 节点的数量? [英] How to get count of distinct XML nodes?

查看:28
本文介绍了如何获取不同 XML 节点的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在递归调用中使用引用时遇到问题.

我想要完成的是根据相应元素内不同节点的最大计数来描述 XML 文档 - 无需事先知道任何节点元素名称.

考虑这个文件:

<记录><样品><TITLE>上级标题</TITLE><SUBTITLE>副标题</SUBTITLE><认证><FNAME>约翰</FNAME><DISPLAY>否</DISPLAY></AUTH><认证><FNAME>简</FNAME><DISPLAY>否</DISPLAY></AUTH><摘要/></样品></记录><记录><样品><TITLE>有趣的标题</TITLE><认证><FNAME>约翰</FNAME><DISPLAY>否</DISPLAY></AUTH><摘要/></样品><样品><TITLE>另一个标题</TITLE><认证><FNAME>简</FNAME><DISPLAY>否</DISPLAY></AUTH><摘要/></样品></记录></数据>

您可以看到 Record 有 1 个或 2 个 SAMPLE 节点,SAMPLE 有 1 个或 2 个 AUTH 节点.我正在尝试生成一个数组,该数组将根据各个节点内不同节点的最大数量来描述文档的结构.

所以我试图得到这样的结果:

$result = [数据"=>[最大计数" =>1、元素" =>[记录"=>[最大计数" =>2、元素" =>[样品" =>[最大计数" =>2、元素" =>[标题" =>[最大计数" =>1],字幕"=>[最大计数" =>1],认证" =>[最大计数" =>2、元素" =>[FNAME" =>[最大计数" =>1],显示"=>[最大计数" =>1]]],摘要" =>[最大计数" =>1]]]]]]]];

为了保持一点理智,我使用

如此有效,这使我们能够分而治之,将问题分解为两个不同的步骤.

  1. 计算给定垂直的基数子节点名称作为 sum (verticies)
  2. 在水平(levels)上求集合summax

这意味着如果我们在这棵树上进行层序遍历,我们应该能够轻松地将节点名称的基数作为所有垂直方向的最大总和.

换句话说,存在获取每个节点的不同子节点名称的基数问题.然后是找到整个级别的最大总和的问题.

最小的、完整的、可验证的、自包含的示例

因此,为了提供一个最小的、完整的、可验证的和自包含的示例,我将依赖于扩展 PHP 的 DOMDocument 而不是您在示例中使用的第三方 XML 库.

<块引用>

可能值得注意的是,此代码不向后兼容 PHP 5,(因为使用了 yield from),因此您必须为此使用 PHP 7实施工作.

首先,我将在 DOMDocument 中实现一个函数,它允许我们使用 发电机.

class SpecialDOM 扩展 DOMDocument {公共函数级别(DOMNode $node = null, $level = 0, $ignore = [#text"]) {如果(!$节点){$node = $this;}$stack = [];如果 ($node->hasChildNodes()) {foreach($node->childNodes as $child) {if (!in_array($child->nodeName, $ignore, true)) {$stack[] = $child;}}}如果($stack){收益率 $level =>$栈;foreach($stack 作为 $node) {从 $this->level($node, $level + 1, $ignore) 产生;}}}}

函数本身的机制其实很简单.它不依赖于传递数组或使用引用,而是使用 DOMDocument 对象本身来构建给定节点中所有子节点的堆栈.然后它可以一次yield整个堆栈.这是级别部分.在这一点上,我们依靠递归从这个堆栈中的每个元素产生下一个级别上的任何其他节点.

这是一个非常简单的 XML 文档,用于演示这是多么简单.

$xml = <<<'XML'<?xml version="1.0";编码=UTF-8"?><数据><记录><SAMPLE>一些样品</SAMPLE></记录><注意><SAMPLE>一些样品</SAMPLE></注意><记录><样品>样品1</SAMPLE><样品>样品2</样品></记录></数据>XML;$dom = 新的 SpecialDOM;$dom->loadXML($xml);foreach($dom->level() as $level => $stack) {echo "- 级别 $level\n";foreach($stack as $item => $node) {echo "$ite​​m =>$node->nodeName\n";}}

输出将如下所示.

<前>- 0 级0 => 数据- 1级0 => 记录1 => 注意2 => 记录- 2 级0 => 样品- 2 级0 => 样品- 2 级0 => 样品1 => 样品

所以至少现在我们有办法知道一个节点在哪个级别以及它在那个级别出现的顺序,这对我们打算做的事情很有用.

现在,为了获得max_count 所寻求的基数,实际上不需要构建嵌套数组的想法.因为我们已经可以从 DOM 树访问节点本身.这意味着我们知道每次迭代时循环中包含哪些 elements .我们不必立即生成整个数组即可开始探索它.我们可以改为按级别顺序执行此操作,这实际上非常酷,因为这意味着您可以构建一个平面数组来为每条记录获取 max_count.

让我演示一下这是如何工作的.

$max = [];foreach($dom->level() as $level => $stack) {$sum = [];foreach($stack as $item => $node) {$name = $node->nodeName;//总和如果 (!isset($sum[$name])) {$sum[$name] = 1;} 别的 {$sum[$name]++;}//最大值if (!isset($max[$level][$name])) {$max[$level][$name] = 1;} 别的 {$max[$level][$name] = max($sum[$name], $max[$level][$name]);}}}var_dump($max);

我们得到的输出看起来像这样.

<前>数组(3){[0]=>数组(1){[数据"]=>整数(1)}[1]=>数组(2){[记录"]=>整数(2)["注意"]=>整数(1)}[2]=>数组(1){[样品"]=>整数(2)}}

这证明我们可以计算 max_count 而无需引用或复杂的嵌套数组.当您消除 PHP 数组的单向映射语义时,也更容易理解.

概要

这是示例 XML 文档中此代码的结果输出.

<前>数组(5){[0]=>数组(1){[数据"]=>整数(1)}[1]=>数组(1){[记录"]=>整数(2)}[2]=>数组(1){[样品"]=>整数(2)}[3]=>数组(4){[标题"]=>整数(1)["字幕"]=>整数(1)[授权"]=>整数(2)[摘要"]=>整数(1)}[4]=>数组(2){["FNAME"]=>整数(1)[显示"]=>整数(1)}}

这与每个子数组的 max_count 相同.

  • 0 级
    • 数据 =>max_count 1
  • 1 级
    • 记录 =>max_count 2
  • 2 级
    • SAMPLE =>max_count 2
  • 3 级
    • TITLE =>max_count 1
    • SUBTITLE =>max_count 1
    • AUTH =>max_count 2
    • 抽象 =>max_count 1
  • 第 4 级
    • FNAME =>max_count 1
    • DISPLAY =>max_count 1

要获取整个循环中任何这些节点的元素,只需查看 $node->childNodes,因为您已经拥有树(因此无需引用).

您需要将元素嵌套到数组中的唯一原因是 PHP 数组的键必须是唯一的,并且由于您使用节点名称作为键,因此需要嵌套以获得树的较低级别并且仍然正确构造 max_count 的值.所以这是一个数据结构问题,我通过避免在数据结构之后建模解决方案来解决它.

I'm having trouble using references in recursive calls.

What I am trying to accomplish is to describe an XML document in terms of the max count of distinct nodes within a respective element - without knowing any of the node element names in advance.

Consider this document:

<Data>
    <Record>
        <SAMPLE>
            <TITLE>Superior Title</TITLE>
            <SUBTITLE>Sub Title</SUBTITLE>
            <AUTH>
                <FNAME>John</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <AUTH>
                <FNAME>Jane</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
    </Record>
    <Record>
        <SAMPLE>
            <TITLE>Interesting Title</TITLE>
            <AUTH>
                <FNAME>John</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
        <SAMPLE>
            <TITLE>Another Title</TITLE>
            <AUTH>
                <FNAME>Jane</FNAME>
                <DISPLAY>No</DISPLAY>
            </AUTH>
            <ABSTRACT/>
        </SAMPLE>
    </Record>
</Data>

You can see that a Record has either 1 or 2 SAMPLE nodes and that the SAMPLE has 1 or 2 AUTH nodes. I'm trying to produce an array that will describe the structure of the document in terms of the max number of distinct nodes inside a respective node.

So I'm trying to get a result like this:

$result = [

  "Data" => [
    "max_count" => 1,
    "elements" => [

      "Record" => [
        "max_count" => 2,
        "elements" => [

          "SAMPLE" => [
            "max_count" => 2,
            "elements" => [

              "TITLE" => [
                "max_count" => 1
              ],
              "SUBTITLE" => [
                "max_count" => 1
              ],
              "AUTH" => [
                "max_count" => 2,
                "elements" => [

                  "FNAME" => [
                    "max_count" => 1
                  ],
                  "DISPLAY" => [
                    "max_count" => 1
                  ]

                ]
              ],
              "ABSTRACT" => [
                "max_count" => 1
              ]

            ]
          ]

        ]
      ]

    ]
  ]

];

To preserve a little bit of my sanity, I'm using sabre/xml to do the legwork parsing the XML.

I can get an absolute count of elements using recursive calls with a reference to the original array.

  private function countArrayElements(&$array, &$result){
    // get collection of subnodes
    foreach ($array as $node){

      $name = $this->stripNamespace($node['name']);

      // get count of distinct subnodes
      if (empty($result[$name])){
        $result[$name]["max_count"] = 1;
      } else {
        $result[$name]["max_count"]++;
      }

      if (is_array($node['value'])){
        $this->countArrayElements($node['value'], $result[$name]["elements"]);
      }

    }
  }

So my reasoning was that I could also pass the array by reference and do a comparison, which works for the top two nodes, but somehow resets on the subsequent nodes which results in a count of only 1 for the AUTH node.

  private function countArrayElements(&$array, &$previous){

    // get collection of subnodes
    foreach ($array as $node){

      $name = $this->stripNamespace($node['name']);

      // get count of distinct subnodes
      if (empty($result[$name]["max_count"])){
        $result[$name]["max_count"] = 1;
      } else {
        $result[$name]["max_count"]++;
      }

      // recurse
      if (is_array($node['value'])){
        $result[$name]["elements"] = $this->countArrayElements(
          $node['value'],
          $result[$name]["elements"]
        );
      }

      // compare previous max
      if (!empty($previous[$name]["max_count"])){
        $result[$name]["max_count"] = max(
          $previous[$name]["max_count"],
          $result[$name]["max_count"]
        );
      }

    }

    return $result;

  }

I realize this is a pretty complex question, of which it is just a small piece of a much larger project, so I have tried to break it down as much as possible for this MCVE and I have additionally prepared a special repository of these files complete with a phpunit test.

解决方案

While your solution works, and pretty efficiently given that it operates in O(n*k) time (where n is the number of nodes in the tree and k is the number of vertices), I figured I'd propose an alternative solution that doesn't rely on arrays or references and is more generalized to work, not just for XML, but for any DOM tree. This solution also operates in O(n*k) time, so it's just as efficient. The only difference is you can consume the values from a generator without having to build out the entire array first.

Modeling The Problem

The easiest way for me to understand this problem is to model it as a graph. If we model the document that way what we get are levels and vertices.

So effectively, this allows us to divide and conquer, breaking down the problem into two distinct steps.

  1. Count the cardinal child node names of a given vertical as the sum (verticies)
  2. Find the max of the collective sum on the horizontal (levels)

Which means that if we do level-order traversal on this tree we should be able to easily produce the cardinality of node names as the maximum sum of all the verticals.

In other words, there's a cardinality problem of getting the distinct child node names of each node. Then there's the issue of finding the maximum sum for that entire level.

Minimal, Complete, Verifiable, Self-Contained Example

So to provide a Minimal, Complete, Verifiable, and Self-Contained Example I'm going to rely on extending PHP's DOMDocument rather than the third party XML library you're using in your example.

It's probably worth noting that this code is not backwards compatible with PHP 5, (because of the use of yield from), so you must use PHP 7 for this implementation to work.

First, I'm going to implement a function in DOMDocument that allows us to iterate over the DOM tree in level-order by using a generator.

class SpecialDOM extends DOMDocument {
    public function level(DOMNode $node = null, $level = 0, $ignore = ["#text"]) {
        if (!$node) {
            $node = $this;
        }
        $stack = [];
        if ($node->hasChildNodes()) {
            foreach($node->childNodes as $child) {
                if (!in_array($child->nodeName, $ignore, true)) {
                    $stack[] = $child;
                }
            }
        }
        if ($stack) {
            yield $level => $stack;
            foreach($stack as $node) {
                yield from $this->level($node, $level + 1, $ignore);
            }
        }
    }
}

The mechanics of the function itself is actually quite simple. It doesn't rely on passing around arrays or using references, but instead uses the DOMDocument object itself, to build a stack of all child nodes in a given node. Then it can yield this entire stack at once. This is the level part. At which point we rely on recursion to yield from each element in this stack any other nodes on the next level.

Here's a very simple XML document to demonstrate how straight-forward this is.

$xml = <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>

<Data>
    <Record>
        <SAMPLE>Some Sample</SAMPLE>
    </Record>
    <Note>
        <SAMPLE>Some Sample</SAMPLE>
    </Note>
    <Record>
        <SAMPLE>Sample 1</SAMPLE>
        <SAMPLE>Sample 2</SAMPLE>
    </Record>
</Data>
XML;

$dom = new SpecialDOM;
$dom->loadXML($xml);

foreach($dom->level() as $level => $stack) {
    echo "- Level $level\n";
    foreach($stack as $item => $node) {
        echo "$item => $node->nodeName\n";
    }
}

The output will look like this.

- Level 0
0 => Data
- Level 1
0 => Record
1 => Note
2 => Record
- Level 2
0 => SAMPLE
- Level 2
0 => SAMPLE
- Level 2
0 => SAMPLE
1 => SAMPLE

So at least now we have a way of knowing what level a node is on and in what order it appears on that level, which is useful for what we intend to do.

Now the idea of building a nested array is actually unnecessary to obtain the cardinality sought by max_count. Because we already have access to the nodes themselves from the DOM tree. Which means we know what elements are contained therein inside of our loop at each iteration. We don't have to generate the entire array at once to begin exploring it. We can do this at a level-order instead, which is actually really cool, because it means you can build a flat array to get to max_count for each record.

Let me demonstrate how that would work.

$max = [];
foreach($dom->level() as $level => $stack) {
    $sum = [];
    foreach($stack as $item => $node) {
        $name = $node->nodeName;
        // the sum
        if (!isset($sum[$name])) {
            $sum[$name] = 1;
        } else {
            $sum[$name]++;
        }
        // the maximum
        if (!isset($max[$level][$name])) {
            $max[$level][$name] = 1;
        } else {
            $max[$level][$name] = max($sum[$name], $max[$level][$name]);
        }
    }
}

var_dump($max);

The output we get would look like this.

array(3) {
  [0]=>
  array(1) {
    ["Data"]=>
    int(1)
  }
  [1]=>
  array(2) {
    ["Record"]=>
    int(2)
    ["Note"]=>
    int(1)
  }
  [2]=>
  array(1) {
    ["SAMPLE"]=>
    int(2)
  }
}

Which proves that we can calculate max_count without the need for references or complex nested arrays. It's also easier to wrap your head around when you obviate the one-way mapping semantics of PHP arrays.

Synopsis

Here's the resulting output from this code on your sample XML document.

array(5) {
  [0]=>
  array(1) {
    ["Data"]=>
    int(1)
  }
  [1]=>
  array(1) {
    ["Record"]=>
    int(2)
  }
  [2]=>
  array(1) {
    ["SAMPLE"]=>
    int(2)
  }
  [3]=>
  array(4) {
    ["TITLE"]=>
    int(1)
    ["SUBTITLE"]=>
    int(1)
    ["AUTH"]=>
    int(2)
    ["ABSTRACT"]=>
    int(1)
  }
  [4]=>
  array(2) {
    ["FNAME"]=>
    int(1)
    ["DISPLAY"]=>
    int(1)
  }
}

Which is identical to the max_count of each of your sub arrays.

  • Level 0
    • Data => max_count 1
  • Level 1
    • Record => max_count 2
  • Level 2
    • SAMPLE => max_count 2
  • Level 3
    • TITLE => max_count 1
    • SUBTITLE => max_count 1
    • AUTH => max_count 2
    • ABSTRACT => max_count 1
  • LEVEL 4
    • FNAME => max_count 1
    • DISPLAY => max_count 1

To get the elements for any of these nodes throughout the loop just look at $node->childNodes as you already have the tree (thus eliminating the need for references).

The only reason you needed to nest the elements into your array is because the keys of a PHP array have to be unique and since you're using the node name as the key this requires nesting to get the lower levels of the tree and still structure the value of max_count properly. So it's a data structure problem there and I solve it differently by avoiding modeling the solution after the data structure.

这篇关于如何获取不同 XML 节点的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆