过滤掉具有“display:none"作为标签属性或在其 CSS 中的 HTML 元素 [英] Filtering out HTML elements which have 'display:none' either as a tag attribute or in their CSS

查看:31
本文介绍了过滤掉具有“display:none"作为标签属性或在其 CSS 中的 HTML 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一些用 Selenium 抓取并用 BeautifulSoup 解析的 html 源代码:

from selenium import webdriver从 bs4 导入 BeautifulSoup驱动程序 = webdriver.Firefox()driver.get(url)汤 = BeautifulSoup(driver.page_source)

有没有办法从 html 代码或汤对象中删除所有具有以下内容的元素:

1.) html标签源中的属性style=display:none(即

...

)

2.) 在页面的 CSS 中具有 display:none 属性

解决方案

我想我记得处理过这样的网站 - IP 地址通过多个 HTML 元素在内部表示,其中一些通过 隐藏display: none 样式,有些具有适当的 CSS 类,使它们不可见.通过 BeautifulSoup 从这个烂摊子中获取真实的 IP 地址非常困难.

好消息是 selenium 实际上处理了这个用例,并且无论何时你得到一个 WebElement.text - 它会返回一个元素的可见文本,这正是你所需要的.

演示:

In [1]: from selenium import webdriver在 [2]: driver = webdriver.Firefox()在 [3] 中:driver.get("http://proxylist.hidemyass.com/")在 [4] 中:对于 driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:] 中的行:...:单元格 = row.find_elements_by_tag_name("td")...: 打印(单元格[1].text.strip())...:101.26.38.162120.198.236.10213.85.92.10...216.161.239.51212.200.111.198

Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

Is there a way to remove, from the html code or the soup object, all elements which either have:

1.) the attribute style=display:none within the html tag source (i.e. <div style = 'display:none'>...</div>)

or

2.) have the display:none property within the page's CSS

解决方案

I think I remember dealing with a web-site like this - the IP address was internally represented via multiple HTML elements, some of them were hidden via display: none style, some had an appropriate CSS class that made them invisible. Getting the real IP address out of this mess via BeautifulSoup was quite difficult.

Good news is that selenium actually handles this use case and whenever you get the .text of a WebElement - it would return you a visible text of an element which is exactly what is needed.

Demo:

In [1]: from selenium import webdriver

In [2]: driver = webdriver.Firefox()

In [3]: driver.get("http://proxylist.hidemyass.com/")

In [4]: for row in driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:]: 
   ...:     cells = row.find_elements_by_tag_name("td")
   ...:     print(cells[1].text.strip())
   ...: 
101.26.38.162
120.198.236.10
213.85.92.10
...
216.161.239.51
212.200.111.198

这篇关于过滤掉具有“display:none"作为标签属性或在其 CSS 中的 HTML 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆