使用BeautifulSoup拆分元素 [英] Split an element with BeautifulSoup

查看:67
本文介绍了使用BeautifulSoup拆分元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些用BeautifulSoup解析的html代码.要求之一是标签不能嵌套在段落或其他文本标签中.

I have some html code that I'm parsing with BeautifulSoup. One of the requirements is that tags are not nested in paragraphs or other text tags.

例如,如果我有这样的代码:

For example if I have a code like this:

<p>
    first text
    <a href="...">
        <img .../>
    </a>
    second text
</p>

我需要将其转换为如下形式:

I need to transform it into something like this:

<p>first text</p>
<img .../>
<p>second text</p>

我做了一些提取图像并将其添加到段落后的操作,如下所示:

I have done something to extract the images and add them after the paragraph, like this:

for match in soup.body.find_all(True, recursive=False):                
    try:            
        for desc in match.descendants:
            try:
                if desc.name in ['img']:      

                    if (hasattr(desc, 'src')):                            
                        # add image as an independent tag
                        tag = soup.new_tag("img")
                        tag['src'] = desc['src']

                        if (hasattr(desc, 'alt')):
                            tag['alt'] = desc['alt']
                        else
                            tag['alt'] = ''

                        match.insert_after(tag)

                    # remove image from its container                            
                    desc.extract()

            except AttributeError:
                temp = 1

    except AttributeError:
        temp = 1

我写了另一段代码删除空元素(例如,删除图像后留空的标签),但是我不知道如何将元素分成两个不同的部分.

I have written another piece of code that deletes empty elements (like the tag that is left empty after its image is removed), but I have no idea how to split the element into two different ones.

推荐答案

import string
the_string.split(the_separator[,the_limit])

这将产生一个数组,因此您可以使用for循环对其进行处理,也可以手动获取元素.

this will produce an array so you can either go trough it with for loop or get elements manualy.

  • the_limit不是必需的

在您的情况下,我认为the_separator必须为"\ n"但这要视情况而定.解析非常有趣,但有时很难做.

In your case I think that the_separator need to be "\n" But that depends from case to case. Parsing is very interesting yet sometimes a dificult thing to do.

这篇关于使用BeautifulSoup拆分元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆