如何从网站中提取冠状病毒病例? [英] How to extract the Coronavirus cases from a website?
问题描述
我正在尝试从网站 (https://www.trackcorona.live) 中提取冠状病毒) 但我遇到了错误.
这是我的代码:
response = requests.get('https://www.trackcorona.live')data = BeautifulSoup(response.text,'html.parser')li = data.find_all(class_='numbers')确认 = int(li[0].get_text())print('已确认案例:', 已确认)
它给出了以下错误(尽管它在几天前工作)因为它返回一个空列表(li)
索引错误回溯(最近一次调用最后一次)<ipython-input-15-7a09f39edc9d>在<模块>2 data=BeautifulSoup(response.text,'html.parser')3 li=data.find_all(class_='numbers')---->4 确认 = int(li[0].get_text())5 个国家 = li[1].get_text()6 死 = int(li[3].get_text())IndexError:列表索引超出范围
嗯,其实网站是在 CloudFlare
后面生成重定向,然后通过 JavaScript
动态加载一旦页面加载,因此我们可以使用多种方法,例如 selenium
和 requests_html
但我会为您提及最快的解决方案,因为我们将呈现 JS
动态 :)
导入cloudcraper从 bs4 导入 BeautifulSoup刮板 = cloudcraper.create_scraper()html = scraper.get("https://www.trackcorona.live/").text汤 = BeautifulSoup(html, 'html.parser')确认 = 汤.find("a", id="valueTot").text打印(确认)
输出:
110981
<块引用>
503
的提示 响应代码
:
基本上是指<code>服务不可用的代码.
从技术上讲,您发送的 GET
请求无法提供服务.原因是因为请求卡在请求的 receiver
之间,请求是 https://www.trackcorona.live/ 在哪里将它处理到同一 HOST
上的另一个源,即 https://www.trackcorona.live/?cf_chl_jschl_tk=
其中 __cf_chl_jschl_tk__=
持有要验证的 token
.
因此,您通常应该按照您的代码为host
提供所需数据.
如下所示,显示了 end
网址:
导入请求从 bs4 导入 BeautifulSoup定义主():使用 requests.Session() 作为请求:url = "https://www.trackcorona.live"r = req.get(url)汤 = BeautifulSoup(r.text, 'html.parser')重定向 = f"{url}{soup.find('form', id='challenge-form').get('action')}"打印(重定向)主要的()
输出:
<预类= 朗 - 无prettyprint-越权"> <代码> https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc现在为了能够调用结束URL
,所以你需要传递所需的Form-Data
:
类似的东西:
def Main():使用 requests.Session() 作为请求:url = "https://www.trackcorona.live"r = req.get(url)汤 = BeautifulSoup(r.text, 'html.parser')重定向 = f"{url}{soup.find('form', id='challenge-form').get('action')}"数据 = {'r': '无','jschl_vc': '无','通过': '无','jschl_answer':'无'}r = req.post(重定向,数据=数据)打印(r.text)主要的()
<块引用>
在这里你会得到没有你想要的值的 text
.因为您的值是通过 JS
呈现的.
I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.
This is my code:
response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)
It gives the following error (though it was working few days back) because it is returning an empty list (li)
IndexError
Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
2 data=BeautifulSoup(response.text,'html.parser')
3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
5 countries = li[1].get_text()
6 dead = int(li[3].get_text())
IndexError: list index out of range
Well, Actually the site is generating a redirection behind CloudFlare
, And then it's loaded dynamically via JavaScript
once the page loads, Therefore we can use several approach such as selenium
and requests_html
but i will mention for you the quickest solution for that as we will render the JS
on the fly :)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.trackcorona.live/").text
soup = BeautifulSoup(html, 'html.parser')
confirmed = soup.find("a", id="valueTot").text
print(confirmed)
Output:
110981
A tip for
503
response code
:
Basically that code referring to service unavailable
.
More technically, the GET
request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver
of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST
which is https://www.trackcorona.live/?cf_chl_jschl_tk=
Where __cf_chl_jschl_tk__=
is holding a token
to be authenticated.
So you should usually follow your code to server the host
with required data.
Something like the following showing the end
url:
import requests
from bs4 import BeautifulSoup
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
print(redirect)
Main()
Output:
https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc
Now to be able to call the end URL
so you need to pass the required Form-Data
:
Something like that:
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
data = {
'r': 'none',
'jschl_vc': 'none',
'pass': 'none',
'jschl_answer': 'none'
}
r = req.post(redirect, data=data)
print(r.text)
Main()
here you will end up with
text
without your desired values. because your values is rendered viaJS
.
这篇关于如何从网站中提取冠状病毒病例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!