解析lxml.etree._Element内容 [英] Parsing lxml.etree._Element contents
问题描述
我从<table>
<td align="center" valign="top">
<a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST"
target="_blank">
5548U
</a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>
我正在尝试从该元素(包括空格)中提取"55488 Power La Vaca(8025K)Linux 4.2.x.x".
I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces).
import lxml.etree as ET
td_html = """
<td align="center" valign="top">
<a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST"
target="_blank">
5548U
</a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>
"""
td_elem = ET.fromstring(td_html)
fail_1 = td_elem.find('a').text + td_elem.text
print "FAIL_1", fail_1
print "FAIL_2"
for elem in td_elem.iterchildren():
print elem.tag, elem.text
结果
$ python textxml.py
FAIL_1
5548U
FAIL_2
a
5548U
br None
br None
br None
br None
$
问题
我不得不问这个问题真是令人沮丧,因为这似乎并不难.
Question
It is humbling that I have to ask this question, since it doesn't seem like it should be hard.
如何从td_elem
元素(包括空格)中提取"Power La Vaca(8025K)Linux 4.2.x.x"?
How can I extract "Power La Vaca (8025K) Linux 4.2.x.x" from the td_elem
element (including the spaces)?
请,没有正则表达式解决方案.
Please, no regexp solutions.
显式解决方案(使用Finn对itertext()
的建议):
The explicit solution (using Finn's suggestion of itertext()
):
import lxml.etree as ET
td_html = """
<td align="center" valign="top">
<a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST"
target="_blank">
5548U
</a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>
"""
td_elem = ET.fromstring(td_html)
print "SUCCESS", ' '.join([txt.strip() for txt in td_elem.itertext()])
推荐答案
我知道必须有更好的方法,但这是可行的.
I know there must be a better way but this works.
link = td_elem.find('a').text.strip()
text = ''.join(td_elem.itertext()).strip()
text.split(link)[1]
输出为 Power La Vaca(M8025K)Linux 4.2.x.x
更新:
如果您想用空格代替那些<br>
s
Update:
This is actually better if you want spaces in place of those <br>
s
' '.join(map(str, [el.tail for el in td_elem.iterchildren() if el.tail]))
实际上并不需要map
str
,但是我可以想象会有其他值.
The map
str
isn't actually needed for this but I can imagine other values for which it would be.
这篇关于解析lxml.etree._Element内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!