Wikipedia使用Python进行数据收集 [英] Wikipedia Data Scraping with Python

查看:71
本文介绍了Wikipedia使用Python进行数据收集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下维基百科页面检索3列(NFL球队,球员姓名,大学球队) .我是python的新手,一直在尝试使用beautifulsoup来完成此操作.我只需要属于QB的列,但是尽管有位置,但我什至无法获得所有列.到目前为止,这是我所拥有的,它什么也不输出,我也不完全清楚为什么.我相信这是由于a标签引起的,但我不知道要更改什么.任何帮助将不胜感激.'

I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.'

wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""

table = soup.find("table", { "class" : "wikitable sortable" })

#print table

#output = open('output.csv','w')

for row in table.findAll("tr"):
    cells = row.findAll("href")
    print "---"
    print cells.text
    print "---"
    #For each "tr", assign each "td" to a variable.
    #if len(cells) > 1:
        #NFL = cells[1].find(text=True)
        #player = cells[2].find(text = True)
        #pos = cells[3].find(text=True)
        #college = cells[4].find(text=True)
        #write_to_file = player + " " + NFL + " " + college + " " + pos
        #print write_to_file

    #output.write(write_to_file)

#output.close()

我知道很多东西都被注释掉了,因为我试图找出故障所在.

I know a lot of it is commented it out because I was trying to find where the breakdown was.

推荐答案

这就是我要做的事情:

  • 找到Player Selections段落
  • 使用wikitable" rel = "noreferrer"> find_next_sibling()
  • 在其中找到所有tr标签
  • 对于每行,找到tdth标记,然后按索引获取所需的单元格
  • find the Player Selections paragraph
  • get the next wikitable using find_next_sibling()
  • find all tr tags inside
  • for every row, find td an th tags and get the desired cells by index

这是代码:

filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
    cells = row.find_all(['td', 'th'])

    try:
        nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
    except IndexError:
        continue

    if position != filter_position:
        continue

    print nfl_team, name, position, college

这是输出(仅过滤四分卫):

And here is the output (only quarterbacks are filtered):

Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State

这篇关于Wikipedia使用Python进行数据收集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆