BeautifulSoup .children或.content,标记之间没有空格 [英] BeautifulSoup .children or .content without whitespace between tags

查看:128
本文介绍了BeautifulSoup .children或.content,标记之间没有空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望标签的所有子元素之间都没有空格.但是BeautifulSoups .contents.children也会返回标记之间的空白.

I want all children of a tag without the whitespace between the tags. But BeautifulSoups .contents and .children also returns the whitespace between the tags.

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').contents)

此打印:

['\n', <span>1</span>, '\n', <a href="2.html">2</a>, '\n', <a href="3.html">3</a>, '\n']

print(list(soup.find(id='list').children))

我想要什么:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

有什么办法告诉BeautifulSoup只返回标签而忽略空格吗?

Is there any way to tell BeautifulSoup to return only the tags and ignore the whitespace?

文档并不是很有帮助关于这个话题.示例中的html在标记之间不包含任何空格.

The documentation is not very helpful on this topic. The html in the example does not contain any whitespace between tags.

实际上,删除标签之间所有空白的html可以解决我的问题:

Indeed stripping the html of all whitespace between tags solves my problem:

html = """<div id="list"><span>1</span><a href="2.html">2</a><a href="3.html">3</a></div>"""

使用此html,我得到的标签之间没有空格,因为标签之间没有空格.但是我希望使用BeautifoulSoup,这样我就不必在html源代码中弄乱了.我希望BeautifulSoup为我做到这一点.

Using this html I get the tags without whitespace between the tags because there is no whitespace between the tags. But I hoped to use BeautifoulSoup so I do not have to mess around in the html source code. I was hoping BeautifulSoup does that for me.

另一个解决方法可能是:

Another workaround might be:

print(list(filter(lambda t: t != '\n', soup.find(id='list').contents)))

但是,这似乎是片状的.空格是否保证总是精确地'\n'?

But that seems flaky. Is the whitespace guaranteed to be always exactly '\n'?

重复标记旅的注释:

有许多关于BeautifulSoup和空白的问题.大多数人都在问要摆脱渲染文本"中的空白.

There are many questions asking about BeautifulSoup and whitespace. Most are asking about getting rid of whitespace from the "rendered text".

例如:

BeautifulSoup-摆脱段落空白/换行符

从python BeautifulSoup的输出

两个问题都希望文本没有空格.我想要没有空格的标签.那里的解决方案不适用于我的问题.

Both questions want the text without whitespace. I want the tags without whitespace. The solutions there don't apply to my question.

另一个例子:

使用Beautifulsoup的带有空格的类的正则表达式

这个问题是关于class属性中的空格的.

This question is about whitespace in the class attribute.

推荐答案

BeautifulSoup具有

BeautifulSoup has .find_all(True), which returns all tags without the whitespace between the tags:

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html">2</a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True))

打印:

[<span>1</span>, <a href="2.html">2</a>, <a href="3.html">3</a>]

使用 recursive=False 组合,并且您只会得到直系孩子,而不是孩子的孩子.

Combine with recursive=False, and you get only the direct children and not children of children.

演示我在第二个孩子中添加了<b>.那会是孙子.

to demonstrate I added <b> to the second child. the would be a grandchild.

from bs4 import BeautifulSoup
html = """
<div id="list">
  <span>1</span>
  <a href="2.html"><b>2</b></a>
  <a href="3.html">3</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(id='list').find_all(True, recursive=False))

recursive=False一起打印:

[<span>1</span>, <a href="2.html"><b>2</b></a>, <a href="3.html">3</a>]

recursive=True一起打印:

[<span>1</span>, <a href="2.html"><b>2</b></a>, <b>2</b>, <a href="3.html">3</a>]


琐事:现在有了解决方案,我在StackOverflow中发现了另一个看似无关的问题和答案,该解决方案隐藏在注释中:


Trivia: now that I have the solution I found another seemingly unrelated question and answer in StackOverflow where the solution was hidden in a comment:

为什么BeautifulSoup .children为什么包含无名元素以及预期的标签

这篇关于BeautifulSoup .children或.content,标记之间没有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆