解析lxml.etree._Element内容 [英] Parsing lxml.etree._Element contents

查看:258
本文介绍了解析lxml.etree._Element内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从<table>

<td align="center" valign="top">
  <a href="ConfigGroups.aspx?cfgID=451161&amp;prjID=11778&amp;grpID=DTST" 
    target="_blank">
    5548U
  </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>

我正在尝试从该元素(包括空格)中提取"55488 Power La Vaca(8025K)Linux 4.2.x.x".

I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces).

import lxml.etree as ET
td_html = """
<td align="center" valign="top">
  <a href="ConfigGroups.aspx?cfgID=451161&amp;prjID=11778&amp;grpID=DTST" 
    target="_blank">
    5548U
  </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>
"""

td_elem = ET.fromstring(td_html)

fail_1 = td_elem.find('a').text + td_elem.text
print "FAIL_1", fail_1

print "FAIL_2"
for elem in td_elem.iterchildren():
    print elem.tag, elem.text

结果

$ python textxml.py

FAIL_1
    5548U


FAIL_2
a
    5548U

br None
br None
br None
br None
$

问题

我不得不问这个问题真是令人沮丧,因为这似乎并不难.

Question

It is humbling that I have to ask this question, since it doesn't seem like it should be hard.

如何从td_elem元素(包括空格)中提取"Power La Vaca(8025K)Linux 4.2.x.x"?

How can I extract "Power La Vaca (8025K) Linux 4.2.x.x" from the td_elem element (including the spaces)?

请,没有正则表达式解决方案.

Please, no regexp solutions.

显式解决方案(使用Finn对itertext()的建议):

The explicit solution (using Finn's suggestion of itertext()):

import lxml.etree as ET
td_html = """
<td align="center" valign="top">
  <a href="ConfigGroups.aspx?cfgID=451161&amp;prjID=11778&amp;grpID=DTST" 
    target="_blank">
    5548U
  </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/>
</td>
"""

td_elem = ET.fromstring(td_html)
print "SUCCESS", ' '.join([txt.strip() for txt in td_elem.itertext()])

推荐答案

我知道必须有更好的方法,但这是可行的.

I know there must be a better way but this works.

link = td_elem.find('a').text.strip()
text = ''.join(td_elem.itertext()).strip()
text.split(link)[1]

输出为 Power La Vaca(M8025K)Linux 4.2.x.x

更新: 如果您想用空格代替那些<br> s

Update: This is actually better if you want spaces in place of those <br>s

' '.join(map(str, [el.tail for el in td_elem.iterchildren() if el.tail]))

实际上并不需要map str,但是我可以想象会有其他值.

The map str isn't actually needed for this but I can imagine other values for which it would be.

这篇关于解析lxml.etree._Element内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆