过滤掉具有“display:none"作为标签属性或在其 CSS 中的 HTML 元素 [英] Filtering out HTML elements which have 'display:none' either as a tag attribute or in their CSS
问题描述
假设您有一些用 Selenium 抓取并用 BeautifulSoup 解析的 html 源代码:
from selenium import webdriver从 bs4 导入 BeautifulSoup驱动程序 = webdriver.Firefox()driver.get(url)汤 = BeautifulSoup(driver.page_source)
有没有办法从 html 代码或汤对象中删除所有具有以下内容的元素:
1.) html标签源中的属性style=display:none
(即
)
或
2.) 在页面的 CSS 中具有 display:none
属性
我想我记得处理过这样的网站 - IP 地址通过多个 HTML 元素在内部表示,其中一些通过 隐藏display: none
样式,有些具有适当的 CSS 类,使它们不可见.通过 BeautifulSoup
从这个烂摊子中获取真实的 IP 地址非常困难.
好消息是 selenium
实际上处理了这个用例,并且无论何时你得到一个 WebElement
的 .text
- 它会返回一个元素的可见文本,这正是你所需要的.
演示:
In [1]: from selenium import webdriver在 [2]: driver = webdriver.Firefox()在 [3] 中:driver.get("http://proxylist.hidemyass.com/")在 [4] 中:对于 driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:] 中的行:...:单元格 = row.find_elements_by_tag_name("td")...: 打印(单元格[1].text.strip())...:101.26.38.162120.198.236.10213.85.92.10...216.161.239.51212.200.111.198
Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
Is there a way to remove, from the html code or the soup object, all elements which either have:
1.) the attribute style=display:none
within the html tag source (i.e. <div style = 'display:none'>...</div>
)
or
2.) have the display:none
property within the page's CSS
I think I remember dealing with a web-site like this - the IP address was internally represented via multiple HTML elements, some of them were hidden via display: none
style, some had an appropriate CSS class that made them invisible. Getting the real IP address out of this mess via BeautifulSoup
was quite difficult.
Good news is that selenium
actually handles this use case and whenever you get the .text
of a WebElement
- it would return you a visible text of an element which is exactly what is needed.
Demo:
In [1]: from selenium import webdriver
In [2]: driver = webdriver.Firefox()
In [3]: driver.get("http://proxylist.hidemyass.com/")
In [4]: for row in driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:]:
...: cells = row.find_elements_by_tag_name("td")
...: print(cells[1].text.strip())
...:
101.26.38.162
120.198.236.10
213.85.92.10
...
216.161.239.51
212.200.111.198
这篇关于过滤掉具有“display:none"作为标签属性或在其 CSS 中的 HTML 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!