合并多个 使用python lxml将标签标记为单个 [英] Merge multiple tags to a single one with python lxml

查看：121 发布时间：2020/5/4 8:31:38 python lxml

本文介绍了合并多个 使用python lxml将标签标记为单个的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个python脚本来清理抓取的html内容，它使用BeautifulSoup4并运行良好.最近，我决定学习lxml，但是我发现这些教程(对我而言)更难理解.例如，我使用以下代码将多个 标记合并为一个，即，如果有多个 标记，则删除所有内容，只保留其中一个:

I've a python script to clean scraped html content, it uses BeautifulSoup4 and works pretty well. Recently I have decided to learn lxml but I found the tutorials are harder (for me) to follow. For example I use the following code to merge multiple   tags into one, i.e, if there are more than one   tags, remove all but keep just one:

from bs4 import BeautifulSoup, Tag
data = 'foo<br /><br>bar. <p>foo<br/><br id="1"><br/>bar'
soup = BeautifulSoup(data)
for br in soup.find_all("br"):
    while isinstance(br.next_sibling, Tag) and br.next_sibling.name == 'br':
        br.next_sibling.extract()
print soup
<html><body><p>foo<br/>bar. </p><p>foo<br/>bar</p></body></html>

我如何在lxml中实现这一点?谢谢，

How do I achieve this similar in lxml? Thanks,

推荐答案

您可以尝试.drop_tag()方法来删除重复出现的 标签的重复出现:

You could try .drop_tag() method to remove duplicate consecutive occurences of   tag:

from lxml import html

doc = html.fromstring(data)
for br in doc.findall('.//br'):
    if br.tail is None: # no text immediately after <br> tag
        for dup in br.itersiblings():
            if dup.tag != 'br': # don't merge if there is another tag inbetween
                break
            dup.drop_tag()
            if dup.tail is not None: # don't merge if there is a text inbetween
               break

print(html.tostring(doc))
# -> <div><p>foo<br>bar. </p><p>foo<br>bar</p></div>

这篇关于合并多个 使用python lxml将标签标记为单个的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

合并多个< br/>使用python lxml将标签标记为单个 [英] Merge multiple <br /> tags to a single one with python lxml

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

合并多个&lt; br/&gt;使用python lxml将标签标记为单个 [英] Merge multiple &lt;br /&gt; tags to a single one with python lxml

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

合并多个< br/>使用python lxml将标签标记为单个 [英] Merge multiple <br /> tags to a single one with python lxml

登录关闭