PYTHON:如何使用BeautifulSoup将表格解析为 pandas 数据框 [英] PYTHON: How do I use BeautifulSoup to parse a table into a pandas dataframe
问题描述
我正试图从CDC网站上抓取最近7天报告的COVID-19病例数据.
试图使这个答案更笼统和有用.
您如何识别这是如何解析数据?
首先,您需要检查页面(Ctrl + Shift + I)并导航到网络标签:
第二,您需要刷新页面以记录网络活动.
去哪里找?
检查 XHR 以限制记录数(1);
通过单击记录(2)来浏览记录,并查看其预览响应(3)找出是否是您需要的数据.
它并不总是有效,但是当它起作用时,直接从API解析数据要比通过request/bs4/selenium等编写抓取工具容易得多,应该是首选.
I am trying to scrape the CDC website for the data of the last 7 days reported cases for COVID-19. https://covid.cdc.gov/covid-data-tracker/#cases_casesinlast7days I've tried to find the table, by name, id, class, and it always returns as none type. When I print the data scraped, I cant manually locate the table in the html either. Not sure what I'm doing wrong here. Once the data is imported, I need to populate a pandas dataframe to later use for graphing purposes, and export the data table as a csv.
You might as well request data from the API directly (check out Network tab in your browser while refreshing the page):
import requests
import pandas as pd
endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
df = pd.DataFrame(data["US_MAP_DATA"])
EDIT: Trying to make this answer more general and useful.
How did you discern that this was how to parse the data?
Firstly, you need to inspect the page (Ctrl + Shift + I) and navigate to network tab:
Secondly, you need to refresh the page to record network activity.
Where to look?
Check XHR to limit number of records (1);
Look through the records by clicking on them (2) and check their preview responses (3) to find out if it's the data you need.
It doesn't always work but when it does, parsing data from API directly is so much easier than writing scrapers via requests / bs4 / selenium etc and should be the first choice.
这篇关于PYTHON:如何使用BeautifulSoup将表格解析为 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!