使用XmlSlurper:如何在遍历GPathResult时选择子元素 [英] Using XmlSlurper: How to select sub-elements while iterating over a GPathResult
问题描述
我写了一个HTML解析器,它使用TagSoup将一个格式良好的结构传递给XMLSlurper。
以下是一般化的代码:
def htmlText =
< html>
< body>
< div id =divId class =divclass>
< h2>标题2< / h2>
< ol>
< li>< h3>< a class =这里是地址< span>电话号码:< strong>电话号码< / span>< / h3>< / span>< / strong>< / span>< / address>< / li>
< li>< h3>< a class =boxhref =#href2> href2 link text< / strong>< / span>< span> extra stuff< / span>< / h3>< address>以下是另一个地址< span>另一电话:< strong> 0845 1111111< / strong>< / span><地址>< / li>
< / ol>
< / div>
< / body>
< / html>
def html = new X mlSlurper(new org.ccil.cowan.tagsoup.Parser())。parseText(htmlText);
html。'**'。grep {it。@ class =='divclass'} .ol.li.each {linkItem - >
def link = linkItem.h3.a。@ href
def address = linkItem.address.text()
println$ link:$ address\\\
}
我希望每个人都让我依次选择每个'li',这样我就可以检索相应的href和地址细节。相反,我得到这个输出:
#href1#href2:这是地址电话号码:telephoneHere是另一个地址另一个电话:0845 1111111
我已经在Web上检查过各种示例,这些示例或者处理XML,例如从此文件中检索所有链接。似乎it.h3.a. @ href表达式正在收集文本中的所有hrefs,即使我将它传递给父'li'节点的引用。
你能告诉我:
- 为什么我要输出显示
- 我如何检索每个'li'项目的href /地址对
解决方案用find找到grep:
html。'**'。find {it。@ class =='divclass'} .ol.li.each {linkItem - >
def link = linkItem.h3.a。@ href
def address = linkItem.address.text()
println$ link:$ address\\\
}
然后您就会得到
#href1:这是地址电话号码:telephone
#href2:这是另一个地址另一个电话:0845 1111111
grep返回一个ArrayList,但是查找返回一个NodeChild类:
println html。'**'。grep {it。@ class =='divclass'} .getClass()
println html。'**'。find {it。@ class =='divclass'} .getClass ()
结果为:
类java.util.ArrayList
类groovy.util.slurpersupport.NodeChild
因此,如果你想使用grep,那么你可以嵌套另一个这样的工作,以便它能够工作
html。 '**'。grep {it。@ class =='divclass'} .ol.li.each {
it.each {linkItem - >
def link = linkItem.h3.a。@ href
def address = linkItem.address.text()
println$ link:$ address\\\
}
}
长话短说,在你的情况下,使用find而不是grep。
I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.
Here's the generalised code:
def htmlText = """ <html> <body> <div id="divId" class="divclass"> <h2>Heading 2</h2> <ol> <li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li> <li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li> </ol> </div> </body> </html> """ def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText ); html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" }
I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:
#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.
Can you let me know:
- Why I'm getting the output shown
- How I can retrieve the href/address pairs for each 'li' item
Thanks.
解决方案Replace grep with find:
html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" }
then you'll get
#href1: Here is the addressTelephone number: telephone #href2: Here is another addressAnother telephone: 0845 1111111
grep returns an ArrayList but find returns a NodeChild class:
println html.'**'.grep { it.@class == 'divclass' }.getClass() println html.'**'.find { it.@class == 'divclass' }.getClass()
results in:
class java.util.ArrayList class groovy.util.slurpersupport.NodeChild
thus if you wanted to use grep you could then nest another each like this for it to work
html.'**'.grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem -> def link = linkItem.h3.a.@href def address = linkItem.address.text() println "$link: $address\n" } }
Long story short, in your case, use find rather than grep.
这篇关于使用XmlSlurper:如何在遍历GPathResult时选择子元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!