Wikipedia使用Python进行数据收集 [英] Wikipedia Data Scraping with Python
问题描述
我正在尝试从以下维基百科页面检索3列(NFL球队,球员姓名,大学球队) .我是python的新手,一直在尝试使用beautifulsoup来完成此操作.我只需要属于QB的列,但是尽管有位置,但我什至无法获得所有列.到目前为止,这是我所拥有的,它什么也不输出,我也不完全清楚为什么.我相信这是由于a标签引起的,但我不知道要更改什么.任何帮助将不胜感激.'
I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.'
wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
#print table
#output = open('output.csv','w')
for row in table.findAll("tr"):
cells = row.findAll("href")
print "---"
print cells.text
print "---"
#For each "tr", assign each "td" to a variable.
#if len(cells) > 1:
#NFL = cells[1].find(text=True)
#player = cells[2].find(text = True)
#pos = cells[3].find(text=True)
#college = cells[4].find(text=True)
#write_to_file = player + " " + NFL + " " + college + " " + pos
#print write_to_file
#output.write(write_to_file)
#output.close()
我知道很多东西都被注释掉了,因为我试图找出故障所在.
I know a lot of it is commented it out because I was trying to find where the breakdown was.
推荐答案
这就是我要做的事情:
- 找到
Player Selections
段落 - 使用
wikitable" rel = "noreferrer"> find_next_sibling()
- 在其中找到所有
tr
标签 - 对于每行,找到
td
和th
标记,然后按索引获取所需的单元格
- find the
Player Selections
paragraph - get the next
wikitable
usingfind_next_sibling()
- find all
tr
tags inside - for every row, find
td
anth
tags and get the desired cells by index
这是代码:
filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
cells = row.find_all(['td', 'th'])
try:
nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
except IndexError:
continue
if position != filter_position:
continue
print nfl_team, name, position, college
这是输出(仅过滤四分卫):
And here is the output (only quarterbacks are filtered):
Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State
这篇关于Wikipedia使用Python进行数据收集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!