如何从分裂树的HTML标签 [英] How to split the tags from html tree

查看:154
本文介绍了如何从分裂树的HTML标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的HTML树

 <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>

从这个网站,我需要提取beforeth&LT线; BR>标签

From this html i need to extract the lines beforeth of < br > tag

行1:获得印度石油公司花旗银行卡。现在申请!

line1 : Get the IndianOil Citibank Card. Apply Now!

2号线:获得奖励10X安商场 - 节省超过5%的燃油

line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel

它是如何应该在Python呢?

how it would supposed to do in python?

推荐答案

我觉得你刚才问的前行每个&LT; BR /&GT;

I think you just asked for the line before each <br/>.

这下code会为你做它所​​提供的样品,通过分拆出来的&LT; B&GT; &LT; A&GT; 标签和打印每个元素,其的 .tail 以下同胞是一个&LT; BR /方式&gt;

This following code will do it for the sample you've provided, by striping out the <b> and <a> tags and printing the .tail of each element whose following-sibling is a <br/>.

from lxml import etree

doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")

etree.strip_tags(doc,'a','b')

for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
  print repr(element.tail.strip())

收益率:

'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n   Save Over 5% On Fuel'

这篇关于如何从分裂树的HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆