lxml classic:获取除嵌套标签之外的文本内容吗? [英] lxml classic: Get text content except for that of nested tags?
问题描述
这一定是绝对经典,但是我在这里找不到答案.我正在使用lxml cssselect解析以下标签:
This must be an absolute classic, but I can't find the answer here. I'm parsing the following tag with lxml cssselect:
<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>
我想获取<li>
标记的内容而不包含<span>
标记的内容.
I want to get the content of the <li>
tag without the content of the <span>
tag.
当前,我有:
stop_list = doc.cssselect('ol#stations li a')
start = stop_list[0].text_content().strip()
但这给了我3 Detroit
.我怎么能得到Detroit
?
But that gives me 3 Detroit
. How can I just get Detroit
?
推荐答案
itertext
的元素方法返回节点文本数据的迭代器.对于您的<a>
标记,' Detroit'
将是迭代器返回的第二个值.如果文档的结构始终符合已知规范,则可以跳过特定的文本元素以获取所需的内容.
itertext
method of an element returns an iterator of node's text data. For your <a>
tag, ' Detroit'
would be the 2nd value returned by the iterator. If structure of your document always conforms to a known specification, you could skip specific text elements to get what you need.
from lxml import html
doc = html.fromstring("""<li><a href="/stations/1"><span class="num">3</span> Detroit</a></li>""")
stop_nodes = doc.cssselect('li a')
stop_names = []
for start in stop_list:
node_text = start.itertext()
node_text.next() # Skip '3'
stop_names.append(node_text.next().lstrip())
continue
You can combine css selector with the xpath text()
function mentioned in Zachary's answer like this (If you're more comfortable with using CSS selectors than xpath):
stop_names = [a.xpath('text()').lstrip() for a in doc.cssselect('li a')]
这篇关于lxml classic:获取除嵌套标签之外的文本内容吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!