使用XPath获取HTML元素的文本内容? [英] Get text content of an HTML element using XPath?

查看:92
本文介绍了使用XPath获取HTML元素的文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

查看此HTML

<div>
    <p>
    <span class="abc">Monitor</span> <b>$300</b>
    </p>
    <a href="/add">Add to cart</a>
</div>
<div>
    <p>
    <span class="abc">Keyboard</span> $20 
    </p>
    <a href="/add">Add to cart</a>
</div>

使用xpath我想解析 Monitor $ 300 键盘$ 20 。我使用这个xpath

Using xpath I want to parse Monitor $300 and Keyboard $20. I use this xpath

 //div[a[contains(., "Add to cart")]]/p/text()

但它会选择< span class =abc>监测< /跨度> < b取代; $ 300℃/ B个。我不想要标签。如何获取文本?

But it selects <span class="abc">Monitor</span> <b>$300</b>. I don't want the tags. How do I get only the text?

推荐答案

您希望选择所有后代文本,而不仅仅是子文本:

You want to select all descendant text, not just child text:

//div[a[contains(., "Add to cart")]]/p//text()

请注意 p text() there。

Note the double slash between p and text() there.

这可能也会包含很多inter-tag空格,我需要清理它。使用 lxml 的示例:

This potentially will also include a lot of inter-tag whitespace though, you you'll need to clean that up. Example using lxml:

>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
...     <p>
...     <span class="abc">Monitor</span> <b>$300</b>
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... <div>
...     <p>
...     <span class="abc">Keyboard</span> $20 
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n    ', 'Monitor', ' ', '$300', '\n    ', '\n    ', 'Keyboard', ' $20 \n    ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '$300', 'Keyboard', '$20']

这篇关于使用XPath获取HTML元素的文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆