阻止 BeautifulSoup 删除空格 [英] Stop BeautifulSoup from removing whitespace

查看:27
本文介绍了阻止 BeautifulSoup 删除空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BeautifulSoup 正在删除换行符之前的空格:

BeautifulSoup is removing whitespace between before newlines tags:

print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section>    \n</section>")

上面的代码打印:

<?xml version="1.0" encoding="utf-8"?>
<section>
</section>

注意 section 标签后面的四个空格不见了!有趣的是,如果我这样做:

Notice that the four spaces after the section tag are missing! Interestingly, if I do:

print BeautifulSoup("<?xml version='1.0' encoding='UTF-8'?><section>a    \n</section>")

我明白了:

<?xml version="1.0" encoding="utf-8"?>
<section>a    
</section>

'a' 后面的四个空格现在出现了!如何在原始打印语句中显示四个空格?

The four spaces after 'a' are now present! How can I get the four spaces to show in the original print statement?

推荐答案

作为一种解决方法,您可以尝试将所有

...</section> 替换为 <;pre>...</section> 在解析之前.BeautifulSoup 然后将完全保留这些空间.例如:

As a workaround, you could try replacing all <section>...</section> with <pre>...</section> before parsing. BeautifulSoup would then fully preserve the spaces. For example:

from bs4 import BeautifulSoup
import re

html = "<?xml version='1.0' encoding='UTF-8'?><section>    \n</section>"
html = re.sub(r'(\</?)(section)(\>)', r'\1pre\3', html)
soup = BeautifulSoup(html, "lxml")

print repr(soup.pre.text)    # repr used to show where the spaces are

给你:

u'    \n'

这篇关于阻止 BeautifulSoup 删除空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆