使用Beautifulsoup的带有空格的类的正则表达式 [英] Regular expression for class with whitespaces using Beautifulsoup
问题描述
我发现方法BeautifulSoup.find()将空格分隔的类属性. 在那种情况下,我不能使用正则表达式,如下面的代码所示. 有人可以帮我找到所有树童"元素的正确方法吗?
I found that method BeautifulSoup.find() splits class attribute by whitespaces. In that case I couldn't use regular expression as show in code below. Could you somebody help me to get right way find all 'tree children' elements:
import re
from bs4 import BeautifulSoup
r_html = "<div class='root'>" \
"<div class='tree children1'>text children 1 </div>" \
"<div class='tree children2'>text children 2 </div>" \
"<div class='tree children3'>text children 3 </div>" \
"</div>"
bs_tab = BeautifulSoup(r_html, "html.parser")
workspace_box_visible = bs_tab.findAll('div', {'class':'tree children1'})
print workspace_box_visible # result: [<div class="tree children1">textchildren 1 </div>]
workspace_box_visible = bs_tab.findAll('div', {'class':re.compile('^tree children\d')})
print workspace_box_visible # result: [] >>>> empty array because
#class name was splited by whitespace character<<<<
# >>>>>> print all element classes <<<<<<<
def print_class(class_):
print class_
return False
workspace_box_visible = bs_tab.find('div', {'class': print_class})
# expected:
# root
# tree children1
# tree children2
# tree children3
# actual:
# root
# tree
# children1
# tree
# children2
# tree
# children3
预先感谢
====注释=========
==== comments ==========
stackoverflow网站不允许添加注释超过500个字符, 所以我在这里添加了评论:
stackoverflow site don't allow add comments more than 500 characters, so I added comments here:
上面是一个示例,展示了BeautifulSoup如何查找所需的类.
Above, it was example to show how to BeautifulSoup looking for required classes.
但是,如果我具有DOM结构,例如:
But, If I have DOM structure like:
r_html = "<div class='root'>" \
"<div class='tree children'>zero</div>" \
"<div class='tree children first'>first</div>" \
"<div class='tree children second'>second</div>" \
"<div class='tree children third'>third</div>" \
"</div>"
以及何时需要选择具有类属性的控件:" 树上的孩子 "和" 树上的孩子优先 ', 您(Padraic Cunningham)帖子中描述的所有方法均无效.
and when need to select controls with class attributes: 'tree children' and 'tree children first', All of the methods described in your(Padraic Cunningham) post aren't work.
我找到了使用正则表达式的解决方案:
I found a solution with using regex:
controls = bs_tab.findAll('div')
for control in controls:
if re.search("^tree children|^tree children first", " ".join(control.attrs['class'] if control.attrs.has_key('class') else "")):
print control
和另一种解决方案:
bs_tab.findAll('div', class_='tree children') + bs_tab.findAll('div', class_='tree children first')
我知道,这不是一个好的解决方案.我希望BeautifulSoup模块有合适的方法.
I know, it's not good solution. and I hope that BeautifulSoup module has appropriate method for that.
推荐答案
根据html的结构,有几种不同的方法,它们是css类,因此您可以使用class_=..
或使用.选择:
There are a few different ways depending on the structure of the html, they are css classes so you could just use class_=..
or a css selector using .select:
In [3]: bs_tab.find_all('div', class_="tree")
Out[3]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
In [4]: bs_tab.select("div.tree")
Out[4]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
但是如果您在其他地方还有另一个 tree 类,也会找到.
But if you had another tree class elsewhere that would find then also.
您可以使用选择器来查找包含子级的div:
You could use a selector to find divs that contains children in the class:
In [5]: bs_tab.select("div[class*=children]")
Out[5]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
但是,如果还有其他带有孩子名称的标记类,它们也会被选中.
But again if there were other tag classes with children in the name they would also be picked up.
使用正则表达式可能会更具体一些,并寻找 children ,后跟一个或多个数字:
You could be a bit more specific with a regex and look for children followed by one or more digits:
In [6]: bs_tab.find_all('div', class_=re.compile("children\d+"))
Out[6]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
或者找到所有 div.tree的,然后查看 tag ["class"] starstwith children .
Or find all the div.tree's and see if the last names in tag["class"] starstwith children.
In [7]: [t for t in bs_tab.select("div.tree") if t["class"][-1].startswith("children")]
Out[7]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
或者寻找孩子,看看第一个css类名称是否等于 tree
Or look for children and see if the first css class name is equal to tree
In [8]: [t for t in bs_tab.select("div[class*=children]") if t["class"][0] == "tree"]
Out[8]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
这篇关于使用Beautifulsoup的带有空格的类的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!