使用XmlSlurper:如何在遍历GPathResult时选择子元素 [英] Using XmlSlurper: How to select sub-elements while iterating over a GPathResult

查看:390
本文介绍了使用XmlSlurper:如何在遍历GPathResult时选择子元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个HTML解析器,它使用TagSoup将一个格式良好的结构传递给XMLSlurper。

以下是一般化的代码:

  def htmlText =
< html>
< body>
< div id =divId class =divclass>
< h2>标题2< / h2>
< ol>
< li>< h3>< a class =这里是地址< span>电话号码:< strong>电话号码< / span>< / h3>< / span>< / strong>< / span>< / address>< / li>
< li>< h3>< a class =boxhref =#href2> href2 link text< / strong>< / span>< span> extra stuff< / span>< / h3>< address>以下是另一个地址< span>另一电话:< strong> 0845 1111111< / strong>< / span><地址>< / li>
< / ol>
< / div>
< / body>
< / html>


def html = new X mlSlurper(new org.ccil.cowan.tagsoup.Parser())。parseText(htmlText);

html。'**'。grep {it。@ class =='divclass'} .ol.li.each {linkItem - >
def link = linkItem.h3.a。@ href
def address = linkItem.address.text()
println$ link:$ address\\\

}

我希望每个人都让我依次选择每个'li',这样我就可以检索相应的href和地址细节。相反,我得到这个输出:

 #href1#href2:这是地址电话号码:telephoneHere是另一个地址另一个电话:0845 1111111 

我已经在Web上检查过各种示例,这些示例或者处理XML,例如从此文件中检索所有链接。似乎it.h3.a. @ href表达式正在收集文本中的所有hrefs,即使我将它传递给父'li'节点的引用。



你能告诉我:


  • 为什么我要输出显示

  • 我如何检索每个'li'项目的href /地址对


    解决方案

    用find找到grep:

      html。'**'。find {it。@ class =='divclass'} .ol.li.each {linkItem  - > 
    def link = linkItem.h3.a。@ href
    def address = linkItem.address.text()
    println$ link:$ address\\\

    }

    然后您就会得到

     #href1:这是地址电话号码:telephone 

    #href2:这是另一个地址另一个电话:0845 1111111

    grep返回一个ArrayList,但是查找返回一个NodeChild类:

      println html。'**'。grep {it。@ class =='divclass'} .getClass()
    println html。'**'。find {it。@ class =='divclass'} .getClass ()

    结果为:

     类java.util.ArrayList 
    类groovy.util.slurpersupport.NodeChild

    因此,如果你想使用grep,那么你可以嵌套另一个这样的工作,以便它能够工作

      html。 '**'。grep {it。@ class =='divclass'} .ol.li.each {
    it.each {linkItem - >
    def link = linkItem.h3.a。@ href
    def address = linkItem.address.text()
    println$ link:$ address\\\

    }
    }

    长话短说,在你的情况下,使用find而不是grep。


    I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.

    Here's the generalised code:

    def htmlText = """
    <html>
    <body>
    <div id="divId" class="divclass">
    <h2>Heading 2</h2>
    <ol>
    <li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
    <li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
    </ol>
    </div>
    </body>
    </html>
    """     
    
    def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );
    
    html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
    

    I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:

    #href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
    

    I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.

    Can you let me know:

    • Why I'm getting the output shown
    • How I can retrieve the href/address pairs for each 'li' item

    Thanks.

    解决方案

    Replace grep with find:

    html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
    

    then you'll get

    #href1: Here is the addressTelephone number: telephone
    
    #href2: Here is another addressAnother telephone: 0845 1111111
    

    grep returns an ArrayList but find returns a NodeChild class:

    println html.'**'.grep { it.@class == 'divclass' }.getClass()
    println html.'**'.find { it.@class == 'divclass' }.getClass()
    

    results in:

    class java.util.ArrayList
    class groovy.util.slurpersupport.NodeChild
    

    thus if you wanted to use grep you could then nest another each like this for it to work

    html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
        it.each { linkItem ->
            def link = linkItem.h3.a.@href
            def address = linkItem.address.text()
            println "$link: $address\n"
        }
    }
    

    Long story short, in your case, use find rather than grep.

    这篇关于使用XmlSlurper:如何在遍历GPathResult时选择子元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆