Webharvest If和null测试 [英] Webharvest If and null test
问题描述
我正在尝试让我的程序检查xpath表达式的返回,如果它为null,则应尝试其他表达式,我该怎么做?我已经尝试了网站上的所有示例,并且空白单引号不会编译.
I'm trying to make my program check the return of an xpath expression and if it is null it should try a different one, how do I do this? I have tried all the examples on the website and the blank single quotes will not compile.
<var-def name="googleResults">
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</xpath>
</var-def>
<var-def name="productTruth">
<case>
<if condition="${googleResults != null}">
<var name="googleResults"/>
</if>
<else>
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</xpath>
</else>
</case>
</var-def>
还有什么方法可以操纵已定义的变量以排除字符串的某些部分(例如符号和数字)?
Also is there any way to manipulate a defined variable to exclude certain parts of strings like symbols and numbers?
推荐答案
I have found the same problem as you, where the example from the official WH user manual does not work, because of double single quotes.
作为解决方法,我使用:variable.toString().length() > 0
as a work around I use: variable.toString().length() > 0
这是您的代码:
<var-def name="googleResults">
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</xpath>
</var-def>
<var-def name="productTruth">
<case>
<if condition="${googleResults.toString().length() > 0}">
<var name="googleResults"/>
</if>
<else>
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</xpath>
</else>
</case>
</var-def>
另外,关于代码的一些一般注意事项:
Also, a few notes on your code in general:
1)实际上,下载页面是Web收获中最耗时和最消耗内存的部分.如果第一个xpath没有收集到您想要的信息,您最终将重新下载该页面(重新运行http请求).将http请求的结果保存在变量中,然后您可以重新查询结果,而无需重复下载-这也限制了您访问源服务器的次数,如果要抓取多个页面,这将成为一个问题.
1) Actually downloading the page is the most time and memory - consuming part of web harvest. If the information you want is not collected by the first xpath, you end up re-downloading the page (re-running the http request). save the result of the http request in a variable and you can then re-query the result, without repeating the download - this also limits the number of times you hit the source server, which becomes an issue if you have multiple pages to scrape.
<var-def name="pagetext">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</var-def>
<var-def name="googleResults">
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/div/text()">
<var name="pagetext"/>
</xpath>
</var-def>
<var-def name="productTruth">
<case>
<if condition="${googleResults.toString().length() > 0}">
<var name="googleResults"/>
</if>
<else>
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/text()">
<var name="pagetext"/>
</xpath>
</else>
</case>
</var-def>
2)您可以通过更改xpath避免整个条件:
2) you can avoid the whole conditional by changing the xpath:
//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/descendant-or-self::text()
<var-def name="pagetext">
<html-to-xml>
<http url="http://google.com/shopping?q=asus laptops&hl=en"/>
</html-to-xml>
</var-def>
<var-def name="googleResults">
<xpath expression="//div[@id='center_col']//div[@id='search']//div[@id='ires']//ol/li/div//b/descendant-or-self::text()">
<var name="pagetext"/>
</xpath>
</var-def>
这篇关于Webharvest If和null测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!