如何从网站上提取冠状病毒病例? [英] How to extract the Coronavirus cases from a website?

查看：90 发布时间：2020/9/5 19:25:27 python api web-scraping beautifulsoup

本文介绍了如何从网站上提取冠状病毒病例?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从网站中提取冠状病毒( https://www.trackcorona.live )，但出现错误.

I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.

这是我的代码:

response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)

由于返回了一个空列表(li)，因此出现了以下错误(尽管几天前都在工作)

It gives the following error (though it was working few days back) because it is returning an empty list (li)

 IndexError                               
 Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
      2 data=BeautifulSoup(response.text,'html.parser')
      3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
      5 countries = li[1].get_text()
      6 dead = int(li[3].get_text())

IndexError: list index out of range

推荐答案

好吧，实际上该站点正在CloudFlare后面生成重定向，然后在页面加载后通过JavaScript动态加载，因此我们可以使用多个诸如selenium和requests_html之类的方法，但是我将为您提到最快的解决方案，因为我们将即时渲染JS:)

Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()

html = scraper.get("https://www.trackcorona.live/").text

soup = BeautifulSoup(html, 'html.parser')

confirmed = soup.find("a", id="valueTot").text

print(confirmed)

输出:

503 response code的提示:

基本上，该代码引用了service unavailable.

Basically that code referring to service unavailable.

从技术上讲，无法满足您发送的GET请求.原因是因为请求被卡在请求的receiver之间，即 https://www.trackcorona. live/在哪里处理它到同一HOST上的另一个源，该HOST是> https://www.trackcorona.live/?cf_chl_jschl_tk=

More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=

__cf_chl_jschl_tk__=持有要认证的token的地方.

Where __cf_chl_jschl_tk__= is holding a token to be authenticated.

因此，通常应遵循代码，为host提供所需的数据.

So you should usually follow your code to server the host with required data.

类似于以下显示end网址的内容:

Something like the following showing the end url:

import requests
from bs4 import BeautifulSoup


def Main():
    with requests.Session() as req:
        url = "https://www.trackcorona.live"
        r = req.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
        print(redirect)


Main()

输出:

https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc

现在可以结束呼叫了URL，因此您需要传递必需的Form-Data:

Now to be able to call the end URL so you need to pass the required Form-Data:

类似的东西:

def Main():
    with requests.Session() as req:
        url = "https://www.trackcorona.live"
        r = req.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
        data = {
            'r': 'none',
            'jschl_vc': 'none',
            'pass': 'none',
            'jschl_answer': 'none'
        }
        r = req.post(redirect, data=data)
        print(r.text)




Main()

在这里您将得到text，而没有所需的值.因为您的值是通过JS呈现的.

here you will end up with text without your desired values. because your values is rendered via JS.

这篇关于如何从网站上提取冠状病毒病例?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从网站上提取冠状病毒病例? [英] How to extract the Coronavirus cases from a website?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从网站上提取冠状病毒病例? [英] How to extract the Coronavirus cases from a website?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭