如何通过网络使用Python抓取图表? [英] How to web scrape a chart by using Python?

查看:39
本文介绍了如何通过网络使用Python抓取图表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python 3将本网站的图表从Web上抓取到.csv文件中:

这些是网站上图表到.csv文件中的前六行.注意如何多次使用多个日期.如何实施刮板以获取此输出?

解决方案

  import re汇入要求将熊猫作为pd导入从bs4导入BeautifulSoup从itertools导入groupbyurl ='https://fanside.com/2016/08/11/nba-schedule-2016-national-tv-games/'汤= BeautifulSoup(requests.get(url).content,'html.parser')天=星期一",星期二",星期三",星期四",星期五",星期六",星期日"数据=汤.select_one('.article-content p:has(br)').get_text(strip = True,分隔符='|').split('|')日期,最后一个= {},''对于groupby中的v,g(数据,lambda k:任意(d中的d,以天为单位的d)):如果v:最后= [* g] [0]日期[最后] = []别的:date [last] .extend([re.findall(r'([\ d:] + [AP] M)(.*?)/(.*?)(.*)',d)[0] for d在g]中)all_data = {'Date':[],'Time':[],'Team 1':[],'Team 2':[],'Network':[]}对于dates.items()中的k,v:对于时间,团队1,团队2,v中的网络:all_data ['Date'].append(k)all_data ['Time'].append(time)all_data ['Team 1'].append(team1)all_data ['Team 2'].append(team2)all_data ['Network'].append(network)df = pd.DataFrame(all_data)打印(df)df.to_csv('data.csv') 

打印:

 日期时间Team 1 Team 2网络10月25日,星期二,08:00尼克斯骑士队TNT10月25日,星期二,晚上10:30马刺勇士队TNT10月2日,星期三,晚上8:00迅雷76人队ESPN10月26日,星期三,10:30 PM火箭湖人队ESPN10月4日,星期四,晚上8:00凯尔特人公牛队TNT.. ... ... ... ... ...159 4月8日星期六晚上8:30快船马刺ABC160 4月10日,星期一下午8:00奇才活塞队TNT161 4月10日,星期一10:30 PM火箭快船TNT162 4月12日,星期三,晚上8:00鹰队步行者ESPN163年4月12日,星期三,晚上10:30鹈鹕队开拓者ESPN[164行x 5列] 

并保存 data.csv (来自Libre Office的屏幕截图):

I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule

The chart starts out like:

Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN

I am using these packages:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

The output I want in a .csv file looks like this:

These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?

解决方案

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby

url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')

dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
    if v:
        last = [*g][0]
        dates[last] = []
    else:
        dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])

all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
    for time, team1, team2, network in v:
        all_data['Date'].append(k)
        all_data['Time'].append(time)
        all_data['Team 1'].append(team1)
        all_data['Team 2'].append(team2)
        all_data['Network'].append(network)

df = pd.DataFrame(all_data)
print(df)

df.to_csv('data.csv')

Prints:

                      Date      Time    Team 1     Team 2 Network
0      Tuesday, October 25   8:00 PM    Knicks  Cavaliers     TNT
1      Tuesday, October 25  10:30 PM     Spurs   Warriors     TNT
2    Wednesday, October 26   8:00 PM   Thunder     Sixers    ESPN
3    Wednesday, October 26  10:30 PM   Rockets     Lakers    ESPN
4     Thursday, October 27   8:00 PM   Celtics      Bulls     TNT
..                     ...       ...       ...        ...     ...
159      Saturday, April 8   8:30 PM  Clippers      Spurs     ABC
160       Monday, April 10   8:00 PM   Wizards    Pistons     TNT
161       Monday, April 10  10:30 PM   Rockets   Clippers     TNT
162    Wednesday, April 12   8:00 PM     Hawks     Pacers    ESPN
163    Wednesday, April 12  10:30 PM  Pelicans    Blazers    ESPN

[164 rows x 5 columns]

And saves data.csv (screenshot from Libre Office):

这篇关于如何通过网络使用Python抓取图表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆