Groovy Node.depthFirst() 返回节点和字符串列表? [英] Groovy Node.depthFirst() returning a List of Nodes and Strings?

查看:14
本文介绍了Groovy Node.depthFirst() 返回节点和字符串列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望有人能指出我在这里遗漏的一些明显的东西.我觉得我已经这样做了一百次了,今晚出于某种原因,由此产生的行为让我陷入了困境.

I'm hoping someone will just point out something obvious that I'm missing here. I feel like I've done this a hundred times and for some reason tonight, the behavior coming from this is throwing me for a loop.

我正在从公共 API 读取一些 XML.我想从某个节点(body"中的所有内容)中提取所有文本,其中还包括各种子节点.简单示例:

I'm reading in some XML from a public API. I want to extract all the text from a certain node (everything within 'body'), which also includes a variety of child nodes. Simple example:

<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>

所以最终我想遍历所需节点内的树(同样是body")并提取包含在其自然顺序中的所有文本.很简单,所以我只写了这个 Groovy 小脚本……

So ultimately I want to traverse the tree within the desired node (again, 'body') and extract all the text contained in its natural order. Simple enough, so I just write up this little Groovy script...

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body[0].depthFirst().each { node ->
    if(node.children().size() == 1) {
        println node.text()
    }   
}

...继续以没有方法签名:java.lang.String.children()"而爆炸.所以我在想自己等等,什么?我要疯了吗?"Node.depthFirst() 应该只返回一个节点列表.我添加了一点instanceof"检查,果然,我得到了 Node 对象和 String 对象的组合.具体来说,不在同一行的实体内的行将作为字符串返回,即This contains"和and".其他一切都是节点(正如预期的那样).

...which proceeds to blow up with "No signature of method: java.lang.String.children()". So I'm thinking to myself "wait, what? Am I going crazy?" Node.depthFirst() should only return a List of Node's. I add a little 'instanceof' check and sure enough, I'm getting a combination of Node objects and String objects. Specifically the lines not within entities on the same line are returned as String's, aka "This contains" and "and". Everything else is a Node (as expected).

我可以轻松解决这个问题.但是,这似乎不是正确的行为,我希望有人能指出我正确的方向.

I can work around this easily. However, this doesn't seem like correct behavior and I'm hoping someone can point me in the right direction.

推荐答案

我很确定这是正确的行为(尽管我一直发现 XmlSlurper 和 XmlParser 的 API 很糟糕).你可以迭代的所有东西都应该实现一个节点接口 IMO,并且可能有一个 typeTEXT ,你可以用它来知道从它们那里获取文本.

I'm pretty sure that's correct behavior (though I've always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type of TEXT that you could use to know to get the text from them.

那些文本节点是有效节点,在许多情况下,您会想要点击它们,因为它在 XML 中进行深度优先遍历.如果它们没有被返回,那么您用于检查子大小是否为 1 的算法将不起作用,因为某些节点(如 <p> 标记)在其下方同时具有混合文本和元素.

Those text nodes are valid nodes that in many cases you'd want to hit as it did a depth first traversal through the XML. If they didn't get returned, your algorithm for checking if the children size of 1 wouldn't work because some nodes (like the <p> tag) has both mixed text and elements underneath it.

另外,为什么 depthFirst 不能始终返回文本是唯一子节点的所有文本节点,例如上面的 italic,这让事情变得更糟.

Also, why depthFirst doesn't consistently return all text nodes where the text is the only child, such as for italic above, makes things even worse.

我倾向于使用 groovy 方法的签名来让运行时找出处理每个节点的正确方法(而不是使用类似 instanceof 之类的方法),如下所示:

I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof) like this:

def rawXml = """<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>"""

def processNode(String nodeText) {
    return nodeText
}

def processNode(Object node) {
   if(node.children().size() == 1) {
       return node.text()
   }
}

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
    processNode(node)
}

println xmlText.join(" ")

打印

A Title This contains italics and xref's .  Second Title

或者,XmlSlurper 类可能会做更多您想要/期望的事情,并从 text() 方法获得更合理的输出集.如果你真的不需要对结果进行任何类型的 DOM 遍历(XmlParser 是更好"的),我建议 XmlSlurper:

Alternatively, the XmlSlurper class probably does more what you want/expect it to and has a more reasonable set of output from the text() method. If you really don't need to do any sort of DOM walking with the results (what XmlParser is "better" for), I'd suggest XmlSlurper:

def xmlParser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body[0].text()
println bodyText

打印:

A Title
                    This contains 
                    italics 
                    and
                    xref's
                    .
                Second Title

这篇关于Groovy Node.depthFirst() 返回节点和字符串列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆