如何从分裂树的HTML标签 [英] How to split the tags from html tree
本文介绍了如何从分裂树的HTML标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这是我的HTML树
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
<a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
<a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
从这个网站,我需要提取beforeth&LT线; BR>标签
From this html i need to extract the lines beforeth of < br > tag
行1:获得印度石油公司花旗银行卡。现在申请!
line1 : Get the IndianOil Citibank Card. Apply Now!
2号线:获得奖励10X安商场 - 节省超过5%的燃油
line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel
它是如何应该在Python呢?
how it would supposed to do in python?
推荐答案
我觉得你刚才问的前行每个&LT; BR /&GT;
I think you just asked for the line before each <br/>
.
这下code会为你做它所提供的样品,通过分拆出来的&LT; B&GT;
和&LT; A&GT;
标签和打印每个元素,其的 .tail
以下同胞
是一个&LT; BR /方式&gt;
This following code will do it for the sample you've provided, by striping out the <b>
and <a>
tags and printing the .tail
of each element whose following-sibling
is a <br/>
.
from lxml import etree
doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
<a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
<a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")
etree.strip_tags(doc,'a','b')
for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
print repr(element.tail.strip())
收益率:
'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n Save Over 5% On Fuel'
这篇关于如何从分裂树的HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文