BeautifulSoup .select()方法是否支持使用正则表达式? [英] Does BeautifulSoup .select() method support use of regex?

查看:783
本文介绍了BeautifulSoup .select()方法是否支持使用正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想使用BeautifulSoup解析html,并且想使用CSS选择器来查找特定标签.我会通过做来使"变浓

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

如果我想找到一个标签,其"id"属性的值为"abc",我可以这样做

If I wanted to find a tag whose "id" attribute has a value of "abc" I can do

soup.select('#abc')

如果我想在当前标签下找到所有"a"子标签,我们可以这样做

If I wanted to find all "a" child tags under our current tag, we could do

soup.select('#abc a')

但是,现在,假设我想查找所有其"href"属性具有以"xyz"结尾的值的"a"标签,我想使用正则表达式,我希望可以遵循

But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of

soup.select('#abc a[href] = re.compile(r"xyz$")')

我似乎找不到任何内容表明BeautifulSoup的.select()方法将支持正则表达式.

I can not seem to find anything that says BeautifulSoup's .select() method will support regex.

推荐答案

soup.select()函数仅支持CSS语法;正则表达式不属于其中.

The soup.select() function only supports CSS syntax; regular expressions are not part of that.

可以使用这样的语法将属性结尾与文本匹配:

You can use such syntax to match attributes ending with text:

soup.select('#abc a[href$="xyz"]')

请参阅MSDN上的 CSS属性选择器文档.

See the CSS attribute selectors documentation over on MSDN.

您始终可以使用CSS选择器的结果来继续搜索:

You can always use the results of a CSS selector to continue the search:

for element in soup.select('#abc'):
    child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))

请注意,如 element.select()文档所述:

Note that, as the element.select() documentation states:

这对了解CSS选择器语法的用户来说非常方便.您可以使用Beautiful Soup API来完成所有这些工作.而且,如果您只需要CSS选择器,不妨直接使用lxml:它速度更快,并且支持更多CSS选择器. 但这可以让您结合简单的CSS选择器以及Beautiful Soup API.

This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.

重点增强.

这篇关于BeautifulSoup .select()方法是否支持使用正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆