如何从网站中提取冠状病毒病例? [英] How to extract the Coronavirus cases from a website?

查看:16
本文介绍了如何从网站中提取冠状病毒病例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站 (https://www.trackcorona.live) 中提取冠状病毒) 但我遇到了错误.

这是我的代码:

response = requests.get('https://www.trackcorona.live')data = BeautifulSoup(response.text,'html.parser')li = data.find_all(class_='numbers')确认 = int(li[0].get_text())print('已确认案例:', 已确认)

它给出了以下错误(尽管它在几天前工作)因为它返回一个空列表(li)

 索引错误回溯(最近一次调用最后一次)<ipython-input-15-7a09f39edc9d>在<模块>2 data=BeautifulSoup(response.text,'html.parser')3 li=data.find_all(class_='numbers')---->4 确认 = int(li[0].get_text())5 个国家 = li[1].get_text()6 死 = int(li[3].get_text())IndexError:列表索引超出范围

解决方案

嗯,其实网站是在 CloudFlare 后面生成重定向,然后通过 JavaScript 动态加载一旦页面加载,因此我们可以使用多种方法,例如 seleniumrequests_html 但我会为您提及最快的解决方案,因为我们将呈现 JS 动态 :)

导入cloudcraper从 bs4 导入 BeautifulSoup刮板 = cloudcraper.create_scraper()html = scraper.get("https://www.trackcorona.live/").text汤 = BeautifulSoup(html, 'html.parser')确认 = 汤.find("a", id="valueTot").text打印(确认)

输出:

110981

<块引用>

503 的提示 响应代码:

基本上是指<​​code>服务不可用的代码.

从技术上讲,您发送的 GET 请求无法提供服务.原因是因为请求卡在请求的 receiver 之间,请求是 https://www.trackcorona.live/ 在哪里将它处理到同一 HOST 上的另一个源,即 https://www.trackcorona.live/?cf_chl_jschl_tk=

其中 __cf_chl_jschl_tk__= 持有要验证的 token.

因此,您通常应该按照您的代码为host 提供所需数据.

如下所示,显示了 end 网址:

导入请求从 bs4 导入 BeautifulSoup定义主():使用 requests.Session() 作为请求:url = "https://www.trackcorona.live"r = req.get(url)汤 = BeautifulSoup(r.text, 'html.parser')重定向 = f"{url}{soup.find('form', id='challenge-form').get('action')}"打印(重定向)主要的()

输出:

<预类= 朗 - 无prettyprint-越权"> <代码> https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc

现在为了能够调用结束URL,所以你需要传递所需的Form-Data:

类似的东西:

def Main():使用 requests.Session() 作为请求:url = "https://www.trackcorona.live"r = req.get(url)汤 = BeautifulSoup(r.text, 'html.parser')重定向 = f"{url}{soup.find('form', id='challenge-form').get('action')}"数据 = {'r': '无','jschl_vc': '无','通过': '无','jschl_answer':'无'}r = req.post(重定向,数据=数据)打印(r.text)主要的()

<块引用>

在这里你会得到没有你想要的值的 text.因为您的值是通过 JS 呈现的.

I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.

This is my code:

response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)

It gives the following error (though it was working few days back) because it is returning an empty list (li)

 IndexError                               
 Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
      2 data=BeautifulSoup(response.text,'html.parser')
      3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
      5 countries = li[1].get_text()
      6 dead = int(li[3].get_text())

IndexError: list index out of range

解决方案

Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()

html = scraper.get("https://www.trackcorona.live/").text

soup = BeautifulSoup(html, 'html.parser')

confirmed = soup.find("a", id="valueTot").text

print(confirmed)

Output:

110981

A tip for 503 response code:

Basically that code referring to service unavailable.

More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=

Where __cf_chl_jschl_tk__= is holding a token to be authenticated.

So you should usually follow your code to server the host with required data.

Something like the following showing the end url:

import requests
from bs4 import BeautifulSoup


def Main():
    with requests.Session() as req:
        url = "https://www.trackcorona.live"
        r = req.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
        print(redirect)


Main()

Output:

https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc

Now to be able to call the end URL so you need to pass the required Form-Data:

Something like that:

def Main():
    with requests.Session() as req:
        url = "https://www.trackcorona.live"
        r = req.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
        data = {
            'r': 'none',
            'jschl_vc': 'none',
            'pass': 'none',
            'jschl_answer': 'none'
        }
        r = req.post(redirect, data=data)
        print(r.text)




Main()

here you will end up with text without your desired values. because your values is rendered via JS.

这篇关于如何从网站中提取冠状病毒病例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆