过滤掉标记属性或其CSS中具有"display:none"的HTML元素 [英] Filtering out HTML elements which have 'display:none' either as a tag attribute or in their CSS

查看:550
本文介绍了过滤掉标记属性或其CSS中具有"display:none"的HTML元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一些用Selenium抓取并用BeautifulSoup解析的html源:

Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

是否可以从html代码或汤对象中删除所有具有以下内容的元素:

Is there a way to remove, from the html code or the soup object, all elements which either have:

1.)html标记源(即<div style = 'display:none'>...</div>)中的属性style=display:none

1.) the attribute style=display:none within the html tag source (i.e. <div style = 'display:none'>...</div>)

2.)在页面的CSS中具有display:none属性

2.) have the display:none property within the page's CSS

推荐答案

我想我记得与这样的网站打交道-IP地址在内部通过多个HTML元素表示,其中一些通过display: none隐藏样式,有些具有适当的CSS类,使它们不可见.通过BeautifulSoup从这个混乱中获取真实的IP地址是非常困难的.

I think I remember dealing with a web-site like this - the IP address was internally represented via multiple HTML elements, some of them were hidden via display: none style, some had an appropriate CSS class that made them invisible. Getting the real IP address out of this mess via BeautifulSoup was quite difficult.

好消息是 selenium 实际上可以处理此用例,并且只要您获得 WebElement-它会为您返回元素的可见文本,这正是所需的内容.

Good news is that selenium actually handles this use case and whenever you get the .text of a WebElement - it would return you a visible text of an element which is exactly what is needed.

演示:

In [1]: from selenium import webdriver

In [2]: driver = webdriver.Firefox()

In [3]: driver.get("http://proxylist.hidemyass.com/")

In [4]: for row in driver.find_elements_by_css_selector("section.proxy-results table#listable tr")[1:]: 
   ...:     cells = row.find_elements_by_tag_name("td")
   ...:     print(cells[1].text.strip())
   ...: 
101.26.38.162
120.198.236.10
213.85.92.10
...
216.161.239.51
212.200.111.198

这篇关于过滤掉标记属性或其CSS中具有"display:none"的HTML元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆