lxml classic:获取除嵌套标签之外的文本内容吗? [英] lxml classic: Get text content except for that of nested tags?

查看:71
本文介绍了lxml classic:获取除嵌套标签之外的文本内容吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这一定是绝对经典,但是我在这里找不到答案.我正在使用lxml cssselect解析以下标签:

This must be an absolute classic, but I can't find the answer here. I'm parsing the following tag with lxml cssselect:

<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>

我想获取<li>标记的内容而不包含<span>标记的内容.

I want to get the content of the <li> tag without the content of the <span> tag.

当前,我有:

stop_list = doc.cssselect('ol#stations li a')
start = stop_list[0].text_content().strip()

但这给了我3 Detroit.我怎么能得到Detroit?

But that gives me 3 Detroit. How can I just get Detroit?

推荐答案

itertext的元素方法返回节点文本数据的迭代器.对于您的<a>标记,' Detroit'将是迭代器返回的第二个值.如果文档的结构始终符合已知规范,则可以跳过特定的文本元素以获取所需的内容.

itertext method of an element returns an iterator of node's text data. For your <a> tag, ' Detroit' would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.

from lxml import html

doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
stop_nodes = doc.cssselect('li a') 
stop_names = []
for start in stop_list:
    node_text = start.itertext()
    node_text.next() # Skip '3'
    stop_names.append(node_text.next().lstrip())
    continue

您可以将css选择器与

You can combine css selector with the xpath text() function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):

stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]

这篇关于lxml classic:获取除嵌套标签之外的文本内容吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆