BeautifulSoup-合并连续标签 [英] BeautifulSoup - combine consecutive tags

查看:183
本文介绍了BeautifulSoup-合并连续标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用最混乱的HTML,其中将各个单词拆分为单独的标签,如以下示例所示:

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:

<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>

这很难读,但是基本上"INTRODUCTION"一词被分成了

That's kind of hard to read, but basically the word "INTRODUCTION" is split into

<b><span>I</span></b> 

<b><span>NTRODUCTION</span></b>

span和b标签具有相同的内联属性.

having the same inline properties for both span and b tags.

将这些结合起来的好方法是什么?我以为要遍历才能找到这样的连续b标签,但是我坚持如何合并连续b标签.

What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.

for b in soup.findAll('b'):
    try:
       if b.next_sibling.name=='b':
       ## combine them here??
    except:
        pass

有什么想法吗?

预期的输出如下

<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>

推荐答案

也许您可以检查b.previousSibling是否为b标记,然后将当前节点的内部文本附加到该标记中.完成此操作后-您应该可以使用b.decompose从树中删除当前节点.

Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.

这篇关于BeautifulSoup-合并连续标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆