问题解析与NBA技术统计BeautifulSoup数据 [英] Problems Parsing NBA Boxscore Data with BeautifulSoup

查看:182
本文介绍了问题解析与NBA技术统计BeautifulSoup数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从EPSN解析玩家等级NBA技术统计数据。以下是我尝试的初始部分:

 导入numpy的是NP
进口大熊猫作为PD
进口要求
从BS4进口BeautifulSoup
从日期时间日期时间进口,日期请求= requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
汤= BeautifulSoup(request.text,'html.parser')
表= soup.find_all('表')

看来BeautifulSoup是给我一个奇怪的结果。在源$ C ​​$ C最后的表包含了玩家数据,这是我想提取的东西。查看源$ C ​​$ C网上显示,该表收于线421,这是两支球队成绩表后。但是,如果我们看'汤',有一个附加行关闭表迈阿密统计了。这发生在网上的源$ C ​​$ C线350。

从解析器html.parser'的输出是:

 游戏1:周二,十月30thCeltics107FinalHeat120Recap»技术统计»
游戏2:星期天,一月27thHeat98Final2OTCeltics100Recap»技术统计»
游戏3:周一,三月18thHeat105FinalCeltics103Recap»技术统计»
游戏4:周五,四月12thCeltics101FinalHeat109Recap»技术统计»1 2 3 4ŤBOS 25 29 22 31107MIA 31 31 31 27120波士顿凯尔特人
首发
MIN
FGM-A
3 PM-A
FTM-A
俄立
DREB
REB
AST
STL
BLK

PF
+/-
PTS加内特,PF324-80-01-11111220254-49
布兰登 - 巴斯,PF286-110-03-4651110012-815
保罗 - 皮尔斯,SF416-152-49-905552003-1723
朗多,PG449-140-22-4077130044-1320
考特尼 - 李,SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3 PM-A
FTM-A
俄立
DREB
REB
AST
STL
BLK

PF
+/-
PTS贾里德 - 萨林杰,PF81-20-00-001100001-32
杰夫 - 格林,SF230-40-03-403301010-73
贾森 - 特里,SG252-70-34-400011033-108
莱昂德罗 - 巴博萨,SG166-83-31-201110001 + 416
克里斯 - 威尔考克斯,PFDNP教练的决定
克里斯约瑟夫,SFDNP教练的决定
贾森 - 科林斯,CDNP教练的决定
达科 - 米利西奇,CDNP教练的DECISIONTOTALS
FGM-A
3 PM-A
FTM-A
俄立

正如你所看到的,它结束中档表在俄立,它从来没有这让热火一节。使用LXML分析器的输出是:

 游戏1:周二,十月30thCeltics107FinalHeat120Recap»技术统计»
游戏2:星期天,一月27thHeat98Final2OTCeltics100Recap»技术统计»
游戏3:周一,三月18thHeat105FinalCeltics103Recap»技术统计»
游戏4:周五,四月12thCeltics101FinalHeat109Recap»技术统计»1 2 3 4TBOS 25 29 22 31107MIA 31 31 31 27120

这并不包括在所有的成绩表。完整的code我使用(由于丹尼尔·罗德里格斯)看起来是这样的:

 导入numpy的是NP
进口大熊猫作为PD
进口要求
从BS4进口BeautifulSoup
从日期时间日期时间进口,日期游戏= pd.read_csv('games_13.csv')。set_index(ID)
BASE_URL ='http://espn.go.com/nba/boxscore?gameId={0}请求= requests.get(BASE_URL.format(games.index [0]))
表= BeautifulSoup(request.text,'html.parser')。找到('表'​​,类_ ='模数据)
头= table.find_all('THEAD')
标题=磁头[0] .find_all('TR')[1] .find_all('日')[1:]
标题= [th.text在头次]
列= ['身份证','团队','玩家'] +标题玩家= pd.DataFrame(列=列)高清get_players(球员,TEAM_NAME):
    阵列= np.zeros((LEN(玩家),LEN(头)+1),DTYPE =对象)
    数组[:] = np.nan
    对于我,球员历数(玩家):
        COLS = player.find_all('TD')
        阵[I,0] = COLS [0] .text.split(,)[0]
        对于在范围Ĵ(1,LEN(头)+ 1):
            如果不是COLS [1] .text.startswith('DNP'):
                数组[I,J] = COLS [J]的.text    帧= pd.DataFrame(列=列)
    在数组x:
        行= np.concatenate(([指数,TEAM_NAME],X))。重塑(1,LEN(列))
        新= pd.DataFrame(行,列= frame.columns)
        帧= frame.append(新)
    返回框架对于指数,排在games.iterrows():
    打印(索引)
    请求= requests.get(BASE_URL.format(指数))
    表= BeautifulSoup(request.text,'html.parser')。找到('表'​​,类_ ='模数据)
    头= table.find_all('THEAD')
    机构= table.find_all('TBODY')    TEAM_1 =头[0] .th.text
    team_1_players =机构[0] .find_all('TR')+机构[1] .find_all('TR')
    team_1_players = get_players(team_1_players,TEAM_1)
    玩家= players.append(team_1_players)    TEAM_2 =磁头[3] .th.text
    team_2_players =机构[3] .find_all('TR')+体[4] .find_all('TR')
    team_2_players = get_players(team_2_players,TEAM_2)
    玩家= players.append(team_2_players)玩家= players.set_index('身份证')
打印(玩家)
players.to_csv('players_13.csv')

我想输出的样本是:

 ,ID,团队,球员,MIN FGM-A,3 PM-A,FTM-A,一名俄立,DREB,REB,AST,STL,BLK,TO,PF,+ /  - ,PTS
0,400277722,波士顿凯尔特人队,布兰登 - 巴斯,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,波士顿凯尔特人队,皮尔斯,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,迈阿密热火,巴蒂尔,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+ 12,6
0,400277722,迈阿密热火,勒布朗·詹姆斯,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+ 12,26


解决方案

的结果对我来说BeautifulSoup截断一部分一样,所以我换成通过re.findall soup.find_all选项

  R = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
HTML = r.read()
汤= BeautifulSoup(HTML)statnames = re.search('首发< /次方式> * PTS<?/第i',HTML,re.DOTALL)。集团()
TH =通过re.findall('日* LT; /日,statnames)#每个标签包含一个STATNAME
名字= ['名','团队']
在第T:
   T =应用re.sub('*方式>','',t)的
   T = t.replace('< /日','')
   names.append(T)
打印名称凯尔特人= re.search('波士顿凯尔特人队。*?球队总失误',HTML,re.DOTALL)。集团()
热量= re.search('NBA小米娅floatleft。*?球队总失误',HTML,re.DOTALL)。集团()玩家= STR(汤).split(TD NOWRAP')
对于球员的球员[1:LEN(玩家):
   尝试:
       统计= [re.search('[A-Z]?[A-Z]?[A-Z] [A-Z] {1,} [A-Z] [A-Z] {1,},播放器)。集团()]
   除:
       统计= [re.search('[A-Z] \\。?[A-Z]?\\?[A-Z] [A-Z] {1,},播放器)。集团()]#玩家名称
       如果统计[0]凯尔特人:
          stats.append('凯尔特人')
       ELIF统计[0]热:
          stats.append('热火')
   TD =通过re.findall('TD。*?/ TD,播放器)#每个td标签包含一个统计
   在TD T:
       T =通过re.findall('方式> * LT;',T)
       T =应用re.sub('*方式>','',T [0])
       T = t.replace('<','')
       如果T =''和t =!'\\ XC2 \\ XA0'!
          stats.append(T)
    打印统计

输出=

  [名称,团队,MIN,FGM-A','3 PM-A','FTM-A,俄立,DREB REB,AST,STL','BLK','于','PF','+/-','PTS']
['加内特','凯尔特人','32','4-8','0-0','1-1','1','11','12','2',' 0','2','5','4','-4','9']
['布兰登巴斯','凯尔特人','28','6-11','0-0','3-4','6','5','11','1',' 0','0','1','2','-8','15']
['皮尔斯','凯尔特人','41','6-15','2-4','9-9','0','5','5','5',' 2','0','0','3','-17','23']
['朗多','凯尔特人','44','9-14','0-2','2-4','0','7','7','13',' 0','0','4','4','-13','20']
['康特尼李','凯尔特人','24','5-6','1-1','0-0','0','1','1','1',' 0','0','1','5','-7','11']
['贾里德萨林格','凯尔特人','8','1-2','0-0','0-0','0','1','1','0',' 0','0','0','1','3','2']
['杰夫格林','凯尔特人','23','0-4','0-0','3-4','0','3','3','0',' 1','0','1','0','-7','3']
['特里','凯尔特人','25','2-7','0-3','4-4','0','0','0','1',' 1','0','3','3','-10','8']
['巴博萨','凯尔特人','16','6-8','3-3','1-2','0','1','1','1',' 0','0','0','1','+4','16']
['克里斯 - 威尔考克斯,波士顿凯尔特人,DNP教练的决定]
['克里斯约瑟夫,波士顿凯尔特人,DNP教练的决定]
['贾森 - 科林斯,波士顿凯尔特人,DNP教练的决定]
['米利西奇,波士顿凯尔特人,DNP教练的决定]
['巴蒂尔','热火','29','2-4','2-3','0-0','0','2','2','1',' 1','0','0','3','+12','6']
['詹姆斯','热火','29','10 -16','2-4','4-5','1','9','10','3',' 2','0','0','2','+12','26']
['波什','热火','37','8-15','0-1','3-4','2','8','10','1',' 0','3','1','3','+15','19']
['查尔莫斯','热火','36','3-7','0-1','2-2','0','1','1','11',' 3','0','1','3','+11','8']
['维德','热火','35','10 -22','0-0','9-11','2','1','3','4',' 2','1','4','3','-6','29']
['哈斯勒姆','热火','11','0-1','0-0','0-0','0','3','3','0',' 0','0','1','1','-2','0']
['刘易斯','热火','19','4-5','1-2','1-2','0','5','5','1',' 0','1','0','1','+ 1','10']
['科尔','热火','6','1-2','1-2','0-0','0','0','0','1',' 0','0','1','2','+5','3']
['阿伦','热火','31','5-7','2-3','7-8','0','2','2','2',' 0','0','0','1','9','19']
['米勒','热火','7','0-0','0-0','0-0','0','0','0','1',' 0','0','0','1','+8','0']
[约什·哈雷尔森,迈阿密热火,DNP教练的决定]
['詹姆斯·琼斯,迈阿密热火,DNP教练的决定]
['特雷尔·哈里斯,迈阿密热火,DNP教练的决定]

要赶上D.J.奥古斯丁,最简单的(但并非最不重要简洁)code是:

 尝试:
    统计= [re.search('[A-Z]?[A-Z]?[A-Z] [A-Z] {1,} [A-Z] [A-Z] {1,},播放器)。集团()]
除:
    统计= [re.search('[A-Z] \\。?[A-Z]?\\?[A-Z] [A-Z] {1,},播放器)。集团()]

I am trying to parse player level NBA boxscore data from EPSN. The following is the initial portion of my attempt:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')

It seems that BeautifulSoup is giving me a strange result. The last 'table' in the source code contains the player data and that is what I want to extract. Looking at the source code online shows that this table is closed at line 421, which is AFTER both teams' box scores. However, if we look at 'soup', there is an added line that closes the table BEFORE the Miami stats. This occurs at line 350 in the online source code.

The output from the parser 'html.parser' is:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4 T

BOS 25 29 22 31107MIA 31 31 31 27120

Boston Celtics
STARTERS    
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A  
FTM-A
OREB

As you can see, it ends mid-table at 'OREB' and it never makes it to the Miami Heat section. The output using 'lxml' parser is:

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4T

BOS 25 29 22 31107MIA 31 31 31 27120

This doesn't include the box scores at all. The complete code I'm using (due to Daniel Rodriguez) looks something like:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
    array = np.zeros((len(players), len(headers)+1), dtype=object)
    array[:] = np.nan
    for i, player in enumerate(players):
        cols = player.find_all('td')
        array[i, 0] = cols[0].text.split(',')[0]
        for j in range(1, len(headers) + 1):
            if not cols[1].text.startswith('DNP'):
                array[i, j] = cols[j].text

    frame = pd.DataFrame(columns=columns)
    for x in array:
        line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
        new = pd.DataFrame(line, columns=frame.columns)
        frame = frame.append(new)
    return frame

for index, row in games.iterrows():
    print(index)
    request = requests.get(BASE_URL.format(index))
    table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
    heads = table.find_all('thead')
    bodies = table.find_all('tbody')

    team_1 = heads[0].th.text
    team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
    team_1_players = get_players(team_1_players, team_1)
    players = players.append(team_1_players)

    team_2 = heads[3].th.text
    team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
    team_2_players = get_players(team_2_players, team_2)
    players = players.append(team_2_players)

players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')

A sample of the output I'd like is:

,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26

解决方案

BeautifulSoup truncated part of the results for me as well, so I replaced soup.find_all option with re.findall

r = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
html = r.read()
soup = BeautifulSoup(html)

statnames = re.search('STARTERS</th>.*?PTS</th>',html, re.DOTALL).group()
th = re.findall('th.*</th', statnames) # each th tag contains a statname
names = ['Name', 'Team']
for t in th:
   t = re.sub('.*>','',t)
   t = t.replace('</th','')
   names.append(t)
print names

celts = re.search('Boston Celtics.*?Total Team Turnovers',html,re.DOTALL).group()
heat = re.search('nba-small-mia floatleft.*?Total Team Turnovers',html,re.DOTALL).group()

players = str(soup).split('td nowrap')
for player in players[1:len(players)]:
   try:
       stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
   except:
       stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()] # player name
       if stats[0] in celts:
          stats.append('Boston Celtics')
       elif stats[0] in heat:
          stats.append('Miami Heat')
   td = re.findall('td.*?/td', player) # each td tag contains a stat
   for t in td:
       t = re.findall('>.*<',t)
       t = re.sub('.*>','',t[0])
       t = t.replace('<','')
       if t!='' and t!='\xc2\xa0':
          stats.append(t)
    print stats

output =

['Name', 'Team', 'MIN', 'FGM-A', '3PM-A', 'FTM-A', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', '+/-', 'PTS']
['Kevin Garnett', 'Boston Celtics', '32', '4-8', '0-0', '1-1', '1', '11', '12', '2', '0', '2', '5', '4', '-4', '9']
['Brandon Bass', 'Boston Celtics', '28', '6-11', '0-0', '3-4', '6', '5', '11', '1', '0', '0', '1', '2', '-8', '15']
['Paul Pierce', 'Boston Celtics', '41', '6-15', '2-4', '9-9', '0', '5', '5', '5', '2', '0', '0', '3', '-17', '23']
['Rajon Rondo', 'Boston Celtics', '44', '9-14', '0-2', '2-4', '0', '7', '7', '13', '0', '0', '4', '4', '-13', '20']
['Courtney Lee', 'Boston Celtics', '24', '5-6', '1-1', '0-0', '0', '1', '1', '1', '0', '0', '1', '5', '-7', '11']
['Jared Sullinger', 'Boston Celtics', '8', '1-2', '0-0', '0-0', '0', '1', '1', '0', '0', '0', '0', '1', '-3', '2']
['Jeff Green', 'Boston Celtics', '23', '0-4', '0-0', '3-4', '0', '3', '3', '0', '1', '0', '1', '0', '-7', '3']
['Jason Terry', 'Boston Celtics', '25', '2-7', '0-3', '4-4', '0', '0', '0', '1', '1', '0', '3', '3', '-10', '8']
['Leandro Barbosa', 'Boston Celtics', '16', '6-8', '3-3', '1-2', '0', '1', '1', '1', '0', '0', '0', '1', '+4', '16']
['Chris Wilcox', 'Boston Celtics', "DNP COACH'S DECISION"]
['Kris Joseph', 'Boston Celtics', "DNP COACH'S DECISION"]
['Jason Collins', 'Boston Celtics', "DNP COACH'S DECISION"]
['Darko Milicic', 'Boston Celtics', "DNP COACH'S DECISION"]
['Shane Battier', 'Miami Heat', '29', '2-4', '2-3', '0-0', '0', '2', '2', '1', '1', '0', '0', '3', '+12', '6']
['LeBron James', 'Miami Heat', '29', '10-16', '2-4', '4-5', '1', '9', '10', '3', '2', '0', '0', '2', '+12', '26']
['Chris Bosh', 'Miami Heat', '37', '8-15', '0-1', '3-4', '2', '8', '10', '1', '0', '3', '1', '3', '+15', '19']
['Mario Chalmers', 'Miami Heat', '36', '3-7', '0-1', '2-2', '0', '1', '1', '11', '3', '0', '1', '3', '+11', '8']
['Dwyane Wade', 'Miami Heat', '35', '10-22', '0-0', '9-11', '2', '1', '3', '4', '2', '1', '4', '3', '-6', '29']
['Udonis Haslem', 'Miami Heat', '11', '0-1', '0-0', '0-0', '0', '3', '3', '0', '0', '0', '1', '1', '-2', '0']
['Rashard Lewis', 'Miami Heat', '19', '4-5', '1-2', '1-2', '0', '5', '5', '1', '0', '1', '0', '1', '+1', '10']
['Norris Cole', 'Miami Heat', '6', '1-2', '1-2', '0-0', '0', '0', '0', '1', '0', '0', '1', '2', '+5', '3']
['Ray Allen', 'Miami Heat', '31', '5-7', '2-3', '7-8', '0', '2', '2', '2', '0', '0', '0', '1', '+9', '19']
['Mike Miller', 'Miami Heat', '7', '0-0', '0-0', '0-0', '0', '0', '0', '1', '0', '0', '0', '1', '+8', '0']
['Josh Harrellson', 'Miami Heat', "DNP COACH'S DECISION"]
['James Jones', 'Miami Heat', "DNP COACH'S DECISION"]
['Terrel Harris', 'Miami Heat', "DNP COACH'S DECISION"]

To catch D.J. Augustine, the simplest (but not least concise) code is:

try:
    stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
except:
    stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()]

这篇关于问题解析与NBA技术统计BeautifulSoup数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆