BeautifulSoup:不要在重要的地方添加空格,在不重要的地方删除空格 [英] BeautifulSoup: do not add spaces where they matter, remove them where they don't
问题描述
此示例python程序:
This sample python program:
document='''<p>This is <i>something</i>, it happens
in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())
产生以下输出:
<html>
<body>
<p>
This is
<i>
something
</i>
, it happens
in
<b>
real
</b>
life
</p>
</body>
</html>
这是错误的,因为它在每个打开和关闭标签前后添加空格,例如,</i>
和,
之间不应有空格.我希望它:
That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i>
and ,
. I would like it to:
-
不要在没有空格的地方添加空格(即使在块级标签周围,如果在CSS中使用
display:inline
设置样式,也可能会出现问题.)
Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with
display:inline
in CSS.)
在一个空格中折叠所有空格,除了可选的换行.
Collapse all whitespace in a single space, except optionally for line wrapping.
类似这样的东西:
<html>
<body>
<p>This is
<i>something</i>,
it happens in
<b>real</b> life</p>
</body>
</html>
BeautifulSoup
可能吗?其他推荐的HTML解析器可以处理此问题吗?
Is this possible with BeautifulSoup
? Any other recommended HTML parser that can deal with this?
推荐答案
由于.prettify
习惯将每个标签放在自己的行中,因此不适合用于生产代码.它仅可用于调试输出IMO.只需使用str
内置函数将汤转换为字符串即可.
Because of the habit of .prettify
to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str
builtin function.
您想要的是更改树中的字符串内容;您可以创建一个函数来查找包含两个或多个空格字符序列的所有元素(使用预编译的正则表达式),然后替换其内容.
What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.
顺便说一句,如果像这样编写示例,则可以让Python避免插入无关紧要的空白:
BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:
document = ('<p>This is <i>something</i>, it happens '
'in <b>real</b> life</p>')
这样,您有两个隐式隐含的文字.
This way you have two literals which are implicitly concatinated.
这篇关于BeautifulSoup:不要在重要的地方添加空格,在不重要的地方删除空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!