BeautifulSoup:不要在重要的地方添加空格,在不重要的地方删除它们 [英] BeautifulSoup: do not add spaces where they matter, remove them where they don't
问题描述
这个示例python程序:
This sample python program:
document='''<p>This is <i>something</i>, it happens
in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())
产生以下输出:
<html>
<body>
<p>
This is
<i>
something
</i>
, it happens
in
<b>
real
</b>
life
</p>
</body>
</html>
那是错误的,因为它在每个开始和结束标记前后添加了空格,例如,</i>
和 ,
之间不应该有空格.我希望它:
That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i>
and ,
. I would like it to:
不要在没有空格的地方添加空格(即使在块级标签周围,如果它们在 CSS 中使用
display:inline
设置样式,它们也可能有问题.)
Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with
display:inline
in CSS.)
将所有空格折叠在一个空格中,可选换行除外.
Collapse all whitespace in a single space, except optionally for line wrapping.
像这样:
<html>
<body>
<p>This is
<i>something</i>,
it happens in
<b>real</b> life</p>
</body>
</html>
这可以用 BeautifulSoup
实现吗?任何其他推荐的 HTML 解析器可以处理这个问题?
Is this possible with BeautifulSoup
? Any other recommended HTML parser that can deal with this?
推荐答案
由于.prettify
习惯把每个标签放在自己的一行,不适合生产代码;它仅可用于调试输出,IMO.只需使用 str
内置函数将您的汤转换为字符串即可.
Because of the habit of .prettify
to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str
builtin function.
您想要的是更改树中的字符串内容;您可以创建一个函数来查找包含两个或多个空白字符序列的所有元素(使用预编译的正则表达式),然后替换它们的内容.
What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.
顺便说一句,如果您像这样编写示例,您可以让 Python 避免插入无关紧要的空格:
BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:
document = ('<p>This is <i>something</i>, it happens '
'in <b>real</b> life</p>')
这样你就有了两个隐式连接的文字.
This way you have two literals which are implicitly concatinated.
这篇关于BeautifulSoup:不要在重要的地方添加空格,在不重要的地方删除它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!