重新格式化报废的硒表 [英] Reformatting scraped selenium table
问题描述
我正在抓取表显示体育联盟的信息.到目前为止,对于硒初学者来说是一件好事:
I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:
from selenium import webdriver
import re
import pandas as pd
driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []
for i in infotable:
ilist.append(i.text)
infolist = ilist[0]
for i in matches:
match.append(i.text)
driver.close()
home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])
df = pd.DataFrame({'home' : home, 'away' : away})
date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)
在最后一行,date
删除了表中的所有日期,但我无法将它们链接到相应的游戏.
In the last line, date
scrapes all the dates in the table but I can't link them to the corresponding game.
我的想法是:for child/element "under the date", date = last_found_date
.
最终目标是在df
中再增加两列,其中一列具有匹配项的date
,如果在日期旁边找到任何文本,则下一列,例如'Play Offs'
(如果我可以对date
问题进行排序).
Ultimate goal is to have two more columns in df
, one with the date
of the match and the next if any text found beside the date, for example 'Play Offs'
(I can figure that out myself if I can get the date
issue sorted).
我应该合并其他程序/方法以保留表中标记/元素的顺序吗?
Should I be incorporating another program/method to retain order of tags/elements of the table?
推荐答案
您需要更改提取匹配信息的方式.而不是分别提取home
和away
团队,而是循环提取日期和事件:
You would need to change the way you extract the match information. Instead of separately extracting home
and away
teams, do it in one loop also extracting the dates and events:
from selenium import webdriver
import pandas as pd
driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")
data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
df = pd.DataFrame(data)
print(df)
打印:
away date event home
0 Washington Capitals 25 Apr 2015 Play Offs New York Islanders
1 Minnesota Wild 25 Apr 2015 Play Offs St.Louis Blues
2 Ottawa Senators 25 Apr 2015 Play Offs Montreal Canadiens
3 Pittsburgh Penguins 25 Apr 2015 Play Offs New York Rangers
4 Calgary Flames 24 Apr 2015 Play Offs Vancouver Canucks
5 Chicago Blackhawks 24 Apr 2015 Play Offs Nashville Predators
6 Tampa Bay Lightning 24 Apr 2015 Play Offs Detroit Red Wings
7 New York Islanders 24 Apr 2015 Play Offs Washington Capitals
8 St.Louis Blues 23 Apr 2015 Play Offs Minnesota Wild
9 Anaheim Ducks 23 Apr 2015 Play Offs Winnipeg Jets
10 Montreal Canadiens 23 Apr 2015 Play Offs Ottawa Senators
11 New York Rangers 23 Apr 2015 Play Offs Pittsburgh Penguins
12 Vancouver Canucks 22 Apr 2015 Play Offs Calgary Flames
13 Nashville Predators 22 Apr 2015 Play Offs Chicago Blackhawks
14 Washington Capitals 22 Apr 2015 Play Offs New York Islanders
15 Tampa Bay Lightning 22 Apr 2015 Play Offs Detroit Red Wings
16 Anaheim Ducks 21 Apr 2015 Play Offs Winnipeg Jets
17 St.Louis Blues 21 Apr 2015 Play Offs Minnesota Wild
18 New York Rangers 21 Apr 2015 Play Offs Pittsburgh Penguins
19 Vancouver Canucks 20 Apr 2015 Play Offs Calgary Flames
20 Montreal Canadiens 20 Apr 2015 Play Offs Ottawa Senators
21 Nashville Predators 19 Apr 2015 Play Offs Chicago Blackhawks
22 Washington Capitals 19 Apr 2015 Play Offs New York Islanders
23 Winnipeg Jets 19 Apr 2015 Play Offs Anaheim Ducks
24 Pittsburgh Penguins 19 Apr 2015 Play Offs New York Rangers
25 Minnesota Wild 18 Apr 2015 Play Offs St.Louis Blues
26 Detroit Red Wings 18 Apr 2015 Play Offs Tampa Bay Lightning
27 Calgary Flames 18 Apr 2015 Play Offs Vancouver Canucks
28 Chicago Blackhawks 18 Apr 2015 Play Offs Nashville Predators
29 Ottawa Senators 18 Apr 2015 Play Offs Montreal Canadiens
30 New York Islanders 18 Apr 2015 Play Offs Washington Capitals
31 Winnipeg Jets 17 Apr 2015 Play Offs Anaheim Ducks
32 Minnesota Wild 17 Apr 2015 Play Offs St.Louis Blues
33 Detroit Red Wings 17 Apr 2015 Play Offs Tampa Bay Lightning
34 Pittsburgh Penguins 17 Apr 2015 Play Offs New York Rangers
35 Calgary Flames 16 Apr 2015 Play Offs Vancouver Canucks
36 Chicago Blackhawks 16 Apr 2015 Play Offs Nashville Predators
37 Ottawa Senators 16 Apr 2015 Play Offs Montreal Canadiens
38 New York Islanders 16 Apr 2015 Play Offs Washington Capitals
39 Edmonton Oilers 12 Apr 2015 Not specified Vancouver Canucks
40 Anaheim Ducks 12 Apr 2015 Not specified Arizona Coyotes
41 Chicago Blackhawks 12 Apr 2015 Not specified Colorado Avalanche
42 Nashville Predators 12 Apr 2015 Not specified Dallas Stars
43 Boston Bruins 12 Apr 2015 Not specified Tampa Bay Lightning
44 Pittsburgh Penguins 12 Apr 2015 Not specified Buffalo Sabres
45 Detroit Red Wings 12 Apr 2015 Not specified Carolina Hurricanes
46 New Jersey Devils 12 Apr 2015 Not specified Florida Panthers
47 Columbus Blue Jackets 12 Apr 2015 Not specified New York Islanders
48 Montreal Canadiens 12 Apr 2015 Not specified Toronto Maple Leafs
49 Calgary Flames 11 Apr 2015 Not specified Winnipeg Jets
这篇关于重新格式化报废的硒表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!