使用 Beautiful Soup 获取所有 HTML 标签 [英] Get all HTML tags with Beautiful Soup

查看:36
本文介绍了使用 Beautiful Soup 获取所有 HTML 标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从美丽的汤中获取所有 html 标签的列表.

我看到了 find all 但我必须在搜索之前知道标签的名称.

如果有像

这样的文字

html = """<div>something</div><div>别的东西</div><div class='magical'>嗨</div><p>ok</p>"""

我如何获得像

这样的列表

list_of_tags = ["

", "

", "

", "

"]

我知道如何用正则表达式来做到这一点,但我正在努力学习 BS4

解决方案

您不必为 find_all() 指定任何参数 - 在这种情况下,BeautifulSoup会递归地找到树中的每个标签.示例:

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>>>>html = """<div>东西</div>... <div>别的东西</div>... <div class='magical'>嗨</div>... <p>ok</p>""">>>汤 = BeautifulSoup(html, "html.parser")>>>[soup.find_all() 中标签的标签名称][u'div', u'div', u'div', u'p']>>>[str(tag) 用于soup.find_all() 中的标签]['<div>东西</div>', '<div>别的东西</div>', '<div class="magical">你好</div>', '<p>好的</p>']

I am trying to get a list of all html tags from beautiful soup.

I see find all but I have to know the name of the tag before I search.

If there is text like

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

解决方案

You don't have to specify any arguments to find_all() - in this case, BeautifulSoup would find you every tag in the tree, recursively. Sample:

>>> from bs4 import BeautifulSoup
>>>
>>> html = """<div>something</div>
... <div>something else</div>
... <div class='magical'>hi there</div>
... <p>ok</p>"""
>>> soup = BeautifulSoup(html, "html.parser")
>>> [tag.name for tag in soup.find_all()]
[u'div', u'div', u'div', u'p']
>>> [str(tag) for tag in soup.find_all()]
['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

这篇关于使用 Beautiful Soup 获取所有 HTML 标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆