Beautiful Soup 使用正则表达式查找标签? [英] Beautiful Soup Using Regex to Find Tags?
问题描述
我真的很希望能够让 Beautiful Soup 匹配任何标签列表,就像这样.我知道 attr 接受正则表达式,但是在美丽的汤中有什么东西可以让你这样做吗?
soup.findAll("(a|div)")
输出:
ASDFS<div>自卫队<a>自卫队
我的目标是创建一个可以从站点抓取表格的抓取工具.有时标签的命名不一致,我希望能够输入标签列表来命名表的数据"部分.
find_all()
是 Beautiful Soup 搜索 API 中最受欢迎的方法.
您可以传递各种过滤器.另外,通过一个 list 来查找多个标签:><预><代码>>>>汤.find_all(['a', 'div'])
示例:
<预><代码>>>>从 bs4 导入 BeautifulSoup>>>汤 = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>')>>>汤.find_all(['a', 'div'])[<div>asdfasdf</div>, <a>foo</a>]或者您可以使用正则表达式查找包含 a
或 div
的标签:
I'd really like to be able to allow Beautiful Soup to match any list of tags, like so. I know attr accepts regex, but is there anything in beautiful soup that allows you to do so?
soup.findAll("(a|div)")
Output:
<a> ASDFS
<div> asdfasdf
<a> asdfsdf
My goal is to create a scraper that can grab tables from sites. Sometimes tags are named inconsistently, and I'd like to be able to input a list of tags to name the 'data' part of a table.
find_all()
is the most favored method in the Beautiful Soup search API.
You can pass a variation of filters. Also, pass a list to find multiple tags:
>>> soup.find_all(['a', 'div'])
Example:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>')
>>> soup.find_all(['a', 'div'])
[<div>asdfasdf</div>, <a>foo</a>]
Or you can use a regular expression to find tags that contain a
or div
:
>>> import re
>>> soup.find_all(re.compile("(a|div)"))
这篇关于Beautiful Soup 使用正则表达式查找标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!