递归解析自定义标记 [英] recursively parse custom markup

查看:52
本文介绍了递归解析自定义标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理现有的自定义标记语言(这很丑陋,但不幸的是无法更改,因为我正在处理遗留数据并且它需要与遗留应用程序保持兼容).

I must handle an already existing custom markup language (which is ugly, but unfortunately can not be altered because I'm handling legacy data and it needs to stay compatible with a legacy app).

我需要解析命令范围",并根据用户采取的操作将数据中的这些范围"替换为其他内容(HTML 或 LaTeX 代码)或从输入中完全删除这些范围".

I need to parse command "ranges", and depending on the action taken by the user either replace these "ranges" in the data with something else (HTML or LaTeX code) or entirely remove these "ranges" from the input.

我当前的解决方案是在循环中使用 preg_replace_callback() 直到没有匹配项为止,但对于大型文档来说速度非常慢.(即 57 KB 文档中的 394 次替换大约需要 7 秒)

My current solution solution is using preg_replace_callback() in a loop until there are no matches left, but it is utterly slow for huge documents. (i.e. ~7 seconds for 394 replacements in a 57 KB document)

递归正则表达式似乎不够灵活,因为我需要访问所有匹配项,即使在递归中也是如此.

Recursive regular expressions don't seem to be flexible enough for this task, as i need to access all matches, even in recursion.

问题:如何提高解析性能?

正则表达式可能会被完全删除 - 它们不是必需的,而是我唯一能想到的.

Regular expressions may be completely removed - they are not a requirement but the only thing i could come up with.

注意:下面的代码示例大大减少.(SSCCE) 实际上有许多不同的类型"范围,闭包函数根据操作模式做不同的事情.(从数据库插入值、删除整个范围、转换为另一种格式等.)请记住这一点!

Note: The code example below is heavily reduced. (SSCCE) Actually there are many different "types" of ranges and the closure function does different things depending on the mode of operation. (insert values from DB, remove entire ranges, convert to another format, etc..) Please keep this in mind!

我目前正在做的事情的例子:

Example of what I'm currently doing:

<?php
$data = <<<EOF
some text 1
begin-command
    some text 2
    begin-command
        some text 3
    command-end
    some text 4
    begin-command-if "%VAR%" == "value"
        some text 5
        begin-command
            some text 6
        command-end
    command-end
command-end

EOF;

$regex = '~
    # opening tag
    begin-(?P<type>command(?:-if)?)
    # must not contain a nested "command" or "command-if" command!
    (?!.*begin-command(?:-if)?.*command(?:-if)?-end)
    # the parameters for "command-if" are optional
    (?:
        [\s\n]*?
        (?:")[\s\n]*(?P<leftvalue>[^\\\\]*?)[\s\n]*(?:")
        [\s\n]*
        # the operator is optional
        (?P<operator>[=<>!]*)
        [\s\n]*
        (?:")[\s\n]*(?P<rightvalue>[^\\\\]*?)[\s\n]*(?:")
        [\s\n]*?
    )?
    # the real content
    (?P<content>.*?)
    # closing tag
    command(?:-if)?-end
 ~smx';

$counter = 0;
$loop_replace = true;
while ($loop_replace) {
    $data = preg_replace_callback($regex, function ($matches) use ($counter) {
        global $counter;
        $counter++;
        return "<command id='{$counter}'>{$matches['content']}</command>";
    }, $data, -1, $loop_replace);
}
echo $data;

推荐答案

我现在已经完全删除了用于解析的正则表达式.我意识到实际上可以将原始输入视为某种奇怪表示形式的 XML 标记树.

I've completely removed the regular expressions for parsing now. I realized that actually the raw input can be seen as a XML markup tree in some kind of weird representation.

我现在不使用正则表达式,而是执行以下操作:

Instead of using regular expressions, i now do the following:

  1. 用文本表示(使用 XML 实体)替换所有可以解释为 XML 的内容
  2. 用相应的 XML 标签替换所有 begin-command ... command-end
    (注意实际上有几个不同的命令)
  3. 让真正的解析器 (XML DOM) 处理标记树
  4. 递归遍历 DOM
  5. 对于每个节点,根据操作模式执行适当的操作

这看起来很难看,但我真的不想编写自己的解析器 - 在我用于提高速度的有限时间内,这似乎有点矫枉过正".哦,天哪,这仍然非常快 - 比 RegExp 解决方案快得多.令人印象深刻的是,当您考虑将原始输入转换为有效 XML 并返回的开销时.

This seems ugly, but i really didn't want to write my own parser - that seemed a bit "overkill" in the limited time i have for improving the speed. And oh boy, that is still blazing fast - much faster than the RegExp solution. Impressive, when you consider the overhead converting the raw input to valid XML and back.

对于极快",我的意思是现在只需要大约 200 毫秒的时间来解析以前需要 5-7 秒来解析多个正则表达式的文档.

With "blazing fast" i mean it now takes a mere ~200ms for a document which previously needed 5-7 seconds to parse with several Regular Expressions.

这是我现在使用的代码:

Here is the code I'm using now:

// convert raw input to valid XML representation
$data = str_replace(
    array('<', '>', '&'), 
    array('&lt;', '&gt;', '&amp;'), 
    $data
);
$data = preg_replace(
    '!begin-(command|othercommand|morecommand)(?:-(?P<options>\S+))?!', 
    '<\1 options="\2">', 
    $data
);
$data = preg_replace(
    '!(command|othercommand|morecommand)-end!', 
    '</\1>', 
    $data
);

// use DOM to parse XML representation
$dom = new \DOMDocument();  
$dom->loadXML("<?xml version='1.0' ?>\n<document>".$data.'</document>');
$xpath = new \DOMXPath($dom);

// iterate over DOM, recursively replace commands with conversion results
foreach($xpath->query('./*') as $node) {
    if ($node->nodeType == XML_ELEMENT_NODE)
        convertNode($node, 'form', $dom, $xpath);
}

// convert XML DOM back to raw format
$data = $dom->saveXML();
$data = substr($data, strpos($data, "<document>")+10, -12);
$data = str_replace(
    array('&amp;', '&lt;', '&gt;'), 
    array('&', '<', '>'), 
    $data
);

// output the stuff
echo $data;

function convertNode (\DomNode $node, $output_mode, $dom, $xpath) {
    $type = $node->tagName;
    $children = $xpath->query('./*', $node);

    // recurse over child nodes
    foreach ($children as $childNode) {
        if ($childNode->nodeType == XML_ELEMENT_NODE) {
            convertNode($childNode, $output_mode, $dom, $xpath);
        }
    }

    // in production code, here is actual logic
    // to process the several command types
    $newNode = $dom->createTextNode(
        "<$type>" 
        . $node->textContent
        . "</$type>"
    );

    // replace node with command result
    if ($node->parentNode) {
        $node->parentNode->replaceChild($newNode, $node);
        // just to be sure - normalize parent node
        $newNode->parentNode->normalize();
    } 
}

这篇关于递归解析自定义标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆