Groovy Node.depthFirst()返回节点和字符串列表? [英] Groovy Node.depthFirst() returning a List of Nodes and Strings?

查看:168
本文介绍了Groovy Node.depthFirst()返回节点和字符串列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望有人能指出我在这里失踪的事情。我觉得我已经做了这么多次了,出于某种原因,今晚,来自这个行为的行为正在让我陷入循环。



我正在阅读一些来自公共API的XML。我想从某个节点(body中的所有内容)中提取所有文本,其中还包括各种子节点。简单的例子:

 < xml> 
<元数据>
<文章>
< body>
< sec>
< title>标题< / title>
< p>
这包含
< italic>斜体< / italic>

< xref ref-type =bibr> xref's< / xref>

< / p>
< / sec>
< sec>
< title>第二个标题< / title>
< / sec>
< / body>
< / article>
< / metadata>
< / xml>

因此,最终我想遍历所需节点内的树(再次,'body')并提取所有文本都按其自然顺序包含。很简单,所以我只写了这个小Groovy脚本...


$ b

  def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body [0] .depthFirst()。each {node - >
if(node.children()。size()== 1){
println node.text()
}
}

...继续用没有方法的签名:java.lang.String.children()来炸毁。所以我在想我自己:等等,什么?我疯了吗? Node.depthFirst()应该只返回一个节点列表。我添加了一个'instanceof'检查,果然,我得到了一个N​​ode对象和String对象的组合。具体而言,不在同一行的实体内的行将以String的形式返回,即This contains和and。其他一切都是Node(按预期)。



我可以轻松解决这个问题。但是,这看起来不是正确的行为,我希望有人能指出我的方向。

解决方案

我很确定这是正确的行为(尽管我总是发现XmlSlurper和XmlParser具有棘手的API)。你可以迭代的所有东西都应该实现一个节点接口IMO,并且可能有一个类型 TEXT ,你可以使用它知道从他们那里得到的文字。

这些文本节点是有效的节点,在许多情况下,您希望打开它,因为它首先穿过XML执行深度。如果他们没有得到返回,那么用于检查1的子级大小的算法将不起作用,因为某些节点(如< p> >标签)已经混合文本和下面的元素。



另外,为什么 depthFirst 并不一致地返回文本所在的所有文本节点唯一的孩子,比如上面的 italic 会让事情变得更糟。



我倾向于使用groovy方法的签名,让运行时找出哪个是正确的方式来处理每个节点(而不是像 instanceof 这样的东西):

  def rawXml =< xml> 
<元数据>
<物品>
< body> ;
< sec>
< title>标题< / title>
< p>
这包含
< italic> italics< / italic>

< xref ref-type =bibr> xref's< / xref>

< / p>
< / sec>
< sec>
< title>第二个标题< / title>
< / sec>
< / body>
< / article>
< / metadata>
< / xml>

def processNode(String nodeText){
return nodeText
}

def processNode节点){
if(node.children()。size()== 1){
return node.text()
}
}

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body [0]。'**'。findResults {node - >
processNode(node)
}

println xmlText.join()

打印

 标题这包含斜体和外部参照。第二个标题

另外, XmlSlurper 类可能会做更多你想要的/期望它并且从 text()方法有一个更合理的输出集合。如果你真的不需要做任何类型的DOM来处理结果( XmlParser 是更好的),我建议 XmlSlurper

  def xmlPa rser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body [0] .text()
println bodyText

打印:

 标题
这包含
斜体

xref的

第二个标题


I'm hoping someone will just point out something obvious that I'm missing here. I feel like I've done this a hundred times and for some reason tonight, the behavior coming from this is throwing me for a loop.

I'm reading in some XML from a public API. I want to extract all the text from a certain node (everything within 'body'), which also includes a variety of child nodes. Simple example:

<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>

So ultimately I want to traverse the tree within the desired node (again, 'body') and extract all the text contained in its natural order. Simple enough, so I just write up this little Groovy script...

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body[0].depthFirst().each { node ->
    if(node.children().size() == 1) {
        println node.text()
    }   
}

...which proceeds to blow up with "No signature of method: java.lang.String.children()". So I'm thinking to myself "wait, what? Am I going crazy?" Node.depthFirst() should only return a List of Node's. I add a little 'instanceof' check and sure enough, I'm getting a combination of Node objects and String objects. Specifically the lines not within entities on the same line are returned as String's, aka "This contains" and "and". Everything else is a Node (as expected).

I can work around this easily. However, this doesn't seem like correct behavior and I'm hoping someone can point me in the right direction.

解决方案

I'm pretty sure that's correct behavior (though I've always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type of TEXT that you could use to know to get the text from them.

Those text nodes are valid nodes that in many cases you'd want to hit as it did a depth first traversal through the XML. If they didn't get returned, your algorithm for checking if the children size of 1 wouldn't work because some nodes (like the <p> tag) has both mixed text and elements underneath it.

Also, why depthFirst doesn't consistently return all text nodes where the text is the only child, such as for italic above, makes things even worse.

I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof) like this:

def rawXml = """<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>"""

def processNode(String nodeText) {
    return nodeText
}

def processNode(Object node) {
   if(node.children().size() == 1) {
       return node.text()
   }
}

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
    processNode(node)
}

println xmlText.join(" ")

Prints

A Title This contains italics and xref's .  Second Title

Alternatively, the XmlSlurper class probably does more what you want/expect it to and has a more reasonable set of output from the text() method. If you really don't need to do any sort of DOM walking with the results (what XmlParser is "better" for), I'd suggest XmlSlurper:

def xmlParser = new XmlSlurper()
def xml = xmlParser.parseText(rawXml)
def bodyText = xml.metadata.article.body[0].text()
println bodyText

Prints:

A Title
                    This contains 
                    italics 
                    and
                    xref's
                    .
                Second Title

这篇关于Groovy Node.depthFirst()返回节点和字符串列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆