如何使用Beautifulsoup4等待网站返回数据 [英] How to wait for the site to return the data using Beautifulsoup4

查看:247
本文介绍了如何使用Beautifulsoup4等待网站返回数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用beautifulsoup4编写了一个脚本,该脚本基本上从网页上显示的表格中获取密码列表.

I wrote a script using beautifulsoup4 , the script basically brings the list of ciphers from the table present on a web page.

问题是我的python脚本不等待网页的返回内容,而是中断或说列表索引超出范围".代码如下:

The problem is my python script doesn't wait for the returned content of the web page and either breaks or says 'list index out of range'. The code is as follows:

ssl_lab_url = 'https://www.ssllabs.com/ssltest/analyze.html?d='+site
req  = requests.get(ssl_lab_url)
data = req.text
soup = BeautifulSoup(data)
 print CYELLOW+"Now Bringing in the LIST of cipher gathered from SSL LABS for "+str(ssl_lab_url)+CEND
        for i in tqdm(range(10000)):
           sleep(0.01)
           table = soup.find_all('table',class_='reportTable', limit=5)[-1]
           data = [ str(td.text.split()[0]) for td in table.select("td.tableLeft")]
        print CGREEN+str(data)+CEND
        time.sleep(1)

有时它在data中返回NOTHING或说:

It sometimes return NOTHING in data or says :

Traceback (most recent call last):
  File "multiple_scan_es.py", line 79, in <module>
    scan_cipher_ssl(list_url )
  File "multiple_scan_es.py", line 62, in scan_cipher_ssl
    table = soup.find_all('table',class_='reportTable', limit=5)[-1]
IndexError: list index out of range

我需要在这里等,怎么办?

I need to wait here , how to do so ?

推荐答案

我当时认为此页面使用JavaScript来获取数据,但是它使用旧的HTML方法来刷新页面.

I was thinking that this page use JavaScript to get data but it use old HTML method to refresh page.

它添加了HTML标签<meta http-equiv="refresh" content='**time**; url>,浏览器将在时间秒后重新加载页面.

It adds HTML tag <meta http-equiv="refresh" content='**time**; url> and browser will reload page after time seconds.

您必须检查此标签-如果找到它,则可以等待,并且必须再次加载页面.通常,您无需等待即可重新加载页面,这样您就可以获取数据或再次找到此标签.

You have to check this tag - if you find it then you can wait and you have to load page again. Mostly you can reload page without waiting and you get data or you find this tag again.

import requests
from bs4 import BeautifulSoup
import time

site = 'some_site_name.com'
url = 'https://www.ssllabs.com/ssltest/analyze.html?d='+site

# --- 

while True:
    r = requests.get(url)

    soup = BeautifulSoup(r.text)

    refresh = soup.find_all('meta', attrs={'http-equiv': 'refresh'})
    #print 'refresh:', refresh 

    if not refresh:
        break

    #wait = int(refresh[0].get('content','0').split(';')[0])
    #print 'wait:', wait
    #time.sleep(wait)

# ---

table = soup.find_all('table', class_='reportTable', limit=5)

if table:
    table = table[-1]
    data = [str(td.text.split()[0]) for td in table.select("td.tableLeft")]
    print str(data)
else:
    print "[!] no data"

这篇关于如何使用Beautifulsoup4等待网站返回数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆