需要使用 RegEx 和 BeautifulSoup 查找文本 [英] Need to find text with RegEx and BeautifulSoup
问题描述
我正在尝试解析网站以提取存储在正文中的一些数据,例如:
I'm trying to parse a website to pull out some data that is stored in the body such as this:
<body>
<b>INFORMATION</b>
Hookups: None
Group Sites: No
Station: No
<b>Details</b>
Ramp: Yes
</body>
我想使用 BeautifulSoup4 和 RegEx 来提取 Hookups 和 Group Sites 等的值,但我对 bs4 和 RegEx 都不熟悉.我尝试了以下方法来获取连接值:
I would like to use BeautifulSoup4 and RegEx to pull out the values for Hookups and Group Sites and so on, but I am new to both bs4 and RegEx. I have tried the following to get the Hookups Value:
soup = BeautifulSoup(open('doc.html'))
hookups = soup.find_all(re.compile("Hookups:(.*)Group"))
但搜索返回空.
推荐答案
BeautifulSoup 的 find_all
仅适用于 标签.假设 HTML 如此简单,您实际上可以仅使用纯正则表达式来获得所需的内容.否则,您可以使用 find_all
然后获取 .text
节点.
BeautifulSoup's find_all
only works with tags. You can actually use just a pure regex to get what you need assuming the HTML is this simple. Otherwise you can use find_all
and then get the .text
nodes.
re.findall("Hookups: (.*)", open('doc.html').read())
从 BeautifulSoup 4.2 开始,您还可以使用 text
属性按标签内容进行搜索
You can also search by tag content with the text
property as of BeautifulSoup 4.2
soup.find_all(text=re.compile("Hookups:(.*)Group"));
从 BeautifulSoup 4.4 开始,text
参数被命名为 string
.
Since BeautifulSoup 4.4, the text
argument is named string
.
这篇关于需要使用 RegEx 和 BeautifulSoup 查找文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!