scrapy HtmlXPathSelector通过搜索关键字确定xpath [英] scrapy HtmlXPathSelector determine xpath by searching for keyword

查看:75
本文介绍了scrapy HtmlXPathSelector通过搜索关键字确定xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一部分html,如下所示

I have a portion of html like below

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

我想获取字符串关键字:文本".

I want to get the string "The keyword: The text".

我知道我可以使用Chrome inspect或FF firebug获取上述html的xpath,然后获取hxs.select(xpath).extract(),然后剥离html标签以获取字符串.但是,由于xpath在不同页面之间不一致,因此该方法不够通用.

I know that I can get xpath of above html using Chrome inspect or FF firebug, then hxs.select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.

因此,我正在考虑以下方法: 首先,使用

Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

何时进行pprint我会得到一些回报:

When do pprint I get some return:

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

我的问题是如何获取所需的字符串:关键字:文本".我正在考虑如何确定xpath,如果知道xpath,那么我当然可以获取所需的字符串.

My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.

除了易碎的HtmlXPathSelector,我还接受其他任何解决方案. (例如lxml.html可能具有更多功能,但我对此很陌生).

I am open to any solution other than scrapy HtmlXPathSelector. ( e.g lxml.html might have more features but I am very new to it).

谢谢.

推荐答案

如果我正确理解了您的问题,那么您正在照顾的是跟随同胞".

If I understand your question correctly, "following-sibling" is what you are looking after.

 //*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()

Xpath轴

这篇关于scrapy HtmlXPathSelector通过搜索关键字确定xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆