如何去除 BeautifulSoup 中的空格 [英] How to remove whitespace in BeautifulSoup

查看:16
本文介绍了如何去除 BeautifulSoup 中的空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆 HTML 正在用 BeautifulSoup 进行解析,除了一个小问题外,一切进展顺利.我想将输出保存为单行字符串,以下是我当前的输出:

 
  • Zazzafooky 但一二三!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>
  • 理想情况下我愿意

    <li><span class="plaincharacterwrap break">Zazzaooky 但一二三!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

    有很多多余的空格我想去掉,但不一定可以使用 strip() 去除,我也不能公然删除所有空格,因为我需要保留文本.我该怎么做?正则表达式似乎是一个足够普遍的问题,但这是唯一的方法吗?

    我没有任何 <pre> 标签,所以我可以在那里更有力一些.

    再次感谢!

    解决方案

    以下是不使用正则表达式的方法:

    <预><代码>>>>html = """
  • ... Zazzafooky 但一二三!... </span></li>... <li><span class="plaincharacterwrap break">... Zazzafooky2... </span></li>... <li><span class="plaincharacterwrap break">... Zazzafooky3... </span></li>……">>>html = "".join(line.strip() for line in html.split(" "))>>>html'<li><span class="plaincharacterwrap break">Zazzaooky 但一二三!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

    I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

        <li><span class="plaincharacterwrap break">
                        Zazzafooky but one two three!
                    </span></li>
    <li><span class="plaincharacterwrap break">
                        Zazzafooky2
                    </span></li>
    <li><span class="plaincharacterwrap break">
                        Zazzafooky3
                    </span></li>
    

    Ideally I'd like

    <li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>
    

    There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

    I don't have any <pre> tags so I can be a little more forceful there.

    Thanks once again!

    解决方案

    Here is how you can do it without regular expressions:

    >>> html = """    <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky but one two three!
    ...                 </span></li>
    ... <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky2
    ...                 </span></li>
    ... <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky3
    ...                 </span></li>
    ... """
    >>> html = "".join(line.strip() for line in html.split("
    "))
    >>> html
    '<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'
    

    这篇关于如何去除 BeautifulSoup 中的空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆