Python Beautifulsoup4网站解析 [英] Python Beautifulsoup4 website parsing

查看:62
本文介绍了Python Beautifulsoup4网站解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Beautifulsoup4从网站上抓取一些体育数据,但是在弄清楚如何进行操作时遇到了一些麻烦.我对HTML不太满意,而且似乎无法弄清楚必需的最后一点语法.解析完数据后,我将其插入Pandas数据框.我正在尝试提取主队,客队和得分.到目前为止,这是我的代码:

I'm trying to scrape some sports data from a website using Beautifulsoup4, but am having some trouble figuring out how to proceed. I'm not that great with HTML, and can't seem to figure out the last bit of syntax necessary. Once the data is parsed, I'm going to plug it into a Pandas dataframe. I'm trying to extract the home team, away team, and score. Here's my code so far:

from bs4 import BeautifulSoup
import urllib2
import csv

url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

def has_class_but_no_id(tag):
    return tag.has_attr('score')

writer = csv.writer(open("webScraper.csv", "w"))

for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
    print(tag)

这是示例输出:

<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>

我需要将主队(纽卡斯尔),比分(0-3)和客队(桑德兰)存储在三个不同的区域中.本质上,我一直试图从每个标签中提取值",并且似乎无法弄清楚bs4中的语法.我需要一个tag.value属性,但是我在文档中找到的只是一个tag.nametag.attrs.任何帮助或指针将不胜感激!

I need to store the home team (Newcastle), the score (0-3) and the away team (Sunderland) in three separate fields. Essentially, I'm stuck trying to extract the "value" from each tag, and can't seem to figure out the syntax in bs4. I need like a tag.value property, but all I have found in the documentation is a tag.name or tag.attrs. Any help or pointers would be greatly appreciated!

推荐答案

每个得分单元位于<td class='match-details'>元素内,在这些得分单元上循环以提取比赛详细信息.

Each score unit is located inside a <td class='match-details'> element, loop over those to extract match details.

从那里,您可以使用 .stripped_strings 生成器;只需将其传递给''.join()即可获取标签中包含的所有字符串.分别选择team-homescoreteam-away以便于分析:

From there, you can extract the text from children elements using the .stripped_strings generator; just pass it to ''.join() to get all strings contained in a tag. Pick team-home, score and team-away separately for ease of parsing:

for match in soup.find_all('td', class_='match-details'):
    home_tag = match.find('span', class_='team-home')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='score')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='team-away')
    away = away_tag and ''.join(away_tag.stripped_strings)

再加上一个print,即可得到:

With an additional print this gives:

>>> for match in soup.find_all('td', class_='match-details'):
...     home_tag = match.find('span', class_='team-home')
...     home = home_tag and ''.join(home_tag.stripped_strings)
...     score_tag = match.find('span', class_='score')
...     score = score_tag and ''.join(score_tag.stripped_strings)
...     away_tag = match.find('span', class_='team-away')
...     away = away_tag and ''.join(away_tag.stripped_strings)
...     if home and score and away:
...         print home, score, away
... 
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.

这篇关于Python Beautifulsoup4网站解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆