过滤掉标记属性或其CSS中具有"display:none"的HTML元素 [英] Filtering out HTML elements which have 'display:none' either as a tag attribute or in their CSS
问题描述
假设您有一些用Selenium抓取并用BeautifulSoup解析的html源:
Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
是否可以从html代码或汤对象中删除所有具有以下内容的元素:
Is there a way to remove, from the html code or the soup object, all elements which either have:
1.)html标记源(即<div style = 'display:none'>...</div>
)中的属性style=display:none
1.) the attribute style=display:none
within the html tag source (i.e. <div style = 'display:none'>...</div>
)
或
2.)在页面的CSS中具有display:none
属性
2.) have the display:none
property within the page's CSS
推荐答案
我想我记得与这样的网站打交道-IP地址在内部通过多个HTML元素表示,其中一些通过display: none
隐藏样式,有些具有适当的CSS类,使它们不可见.通过BeautifulSoup
从这个混乱中获取真实的IP地址是非常困难的.
I think I remember dealing with a web-site like this - the IP address was internally represented via multiple HTML elements, some of them were hidden via display: none
style, some had an appropriate CSS class that made them invisible. Getting the real IP address out of this mess via BeautifulSoup
was quite difficult.
好消息是 selenium
实际上可以处理此用例,并且只要您获得WebElement
-它会为您返回元素的可见文本,这正是所需的内容.
Good news is that selenium
actually handles this use case and whenever you get the .text
of a WebElement
- it would return you a visible text of an element which is exactly what is needed.
演示:
In [1]: from selenium import webdriver
In [2]: driver = webdriver.Firefox()
In [3]: driver.get("http://proxylist.hidemyass.com/")
In [4]: for row in driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:]:
...: cells = row.find_elements_by_tag_name("td")
...: print(cells[1].text.strip())
...:
101.26.38.162
120.198.236.10
213.85.92.10
...
216.161.239.51
212.200.111.198
这篇关于过滤掉标记属性或其CSS中具有"display:none"的HTML元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!