重新格式化报废的硒表 [英] Reformatting scraped selenium table

查看:59
本文介绍了重新格式化报废的硒表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取显示体育联盟的信息.到目前为止,对于硒初学者来说是一件好事:

I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:

from selenium import webdriver
import re
import pandas as pd

driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')

driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")

infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []

for i in infotable:
    ilist.append(i.text)
    infolist = ilist[0]

for i in matches:
    match.append(i.text)

driver.close()

home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])

df = pd.DataFrame({'home' : home, 'away' : away})

date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)

在最后一行,date删除了表中的所有日期,但我无法将它们链接到相应的游戏.

In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.

我的想法是:for child/element "under the date", date = last_found_date.

最终目标是在df中再增加两列,其中一列具有匹配项的date,如果在日期旁边找到任何文本,则下一列,例如'Play Offs'(如果我可以对date问题进行排序).

Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).

我应该合并其他程序/方法以保留表中标记/元素的顺序吗?

Should I be incorporating another program/method to retain order of tags/elements of the table?

推荐答案

您需要更改提取匹配信息的方式.而不是分别提取homeaway团队,而是循环提取日期和事件:

You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:

from selenium import webdriver

import pandas as pd

driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")

data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
    home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
    date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text

    if " - " in date:
        date, event = date.split(" - ")
    else:
        event = "Not specified"

    data.append({
        "home": home.strip(),
        "away": away.strip(),
        "date": date.strip(),
        "event": event.strip()
    })

driver.close()

df = pd.DataFrame(data)
print(df)

打印:

                     away         date          event                 home
0     Washington Capitals  25 Apr 2015      Play Offs   New York Islanders
1          Minnesota Wild  25 Apr 2015      Play Offs       St.Louis Blues
2         Ottawa Senators  25 Apr 2015      Play Offs   Montreal Canadiens
3     Pittsburgh Penguins  25 Apr 2015      Play Offs     New York Rangers
4          Calgary Flames  24 Apr 2015      Play Offs    Vancouver Canucks
5      Chicago Blackhawks  24 Apr 2015      Play Offs  Nashville Predators
6     Tampa Bay Lightning  24 Apr 2015      Play Offs    Detroit Red Wings
7      New York Islanders  24 Apr 2015      Play Offs  Washington Capitals
8          St.Louis Blues  23 Apr 2015      Play Offs       Minnesota Wild
9           Anaheim Ducks  23 Apr 2015      Play Offs        Winnipeg Jets
10     Montreal Canadiens  23 Apr 2015      Play Offs      Ottawa Senators
11       New York Rangers  23 Apr 2015      Play Offs  Pittsburgh Penguins
12      Vancouver Canucks  22 Apr 2015      Play Offs       Calgary Flames
13    Nashville Predators  22 Apr 2015      Play Offs   Chicago Blackhawks
14    Washington Capitals  22 Apr 2015      Play Offs   New York Islanders
15    Tampa Bay Lightning  22 Apr 2015      Play Offs    Detroit Red Wings
16          Anaheim Ducks  21 Apr 2015      Play Offs        Winnipeg Jets
17         St.Louis Blues  21 Apr 2015      Play Offs       Minnesota Wild
18       New York Rangers  21 Apr 2015      Play Offs  Pittsburgh Penguins
19      Vancouver Canucks  20 Apr 2015      Play Offs       Calgary Flames
20     Montreal Canadiens  20 Apr 2015      Play Offs      Ottawa Senators
21    Nashville Predators  19 Apr 2015      Play Offs   Chicago Blackhawks
22    Washington Capitals  19 Apr 2015      Play Offs   New York Islanders
23          Winnipeg Jets  19 Apr 2015      Play Offs        Anaheim Ducks
24    Pittsburgh Penguins  19 Apr 2015      Play Offs     New York Rangers
25         Minnesota Wild  18 Apr 2015      Play Offs       St.Louis Blues
26      Detroit Red Wings  18 Apr 2015      Play Offs  Tampa Bay Lightning
27         Calgary Flames  18 Apr 2015      Play Offs    Vancouver Canucks
28     Chicago Blackhawks  18 Apr 2015      Play Offs  Nashville Predators
29        Ottawa Senators  18 Apr 2015      Play Offs   Montreal Canadiens
30     New York Islanders  18 Apr 2015      Play Offs  Washington Capitals
31          Winnipeg Jets  17 Apr 2015      Play Offs        Anaheim Ducks
32         Minnesota Wild  17 Apr 2015      Play Offs       St.Louis Blues
33      Detroit Red Wings  17 Apr 2015      Play Offs  Tampa Bay Lightning
34    Pittsburgh Penguins  17 Apr 2015      Play Offs     New York Rangers
35         Calgary Flames  16 Apr 2015      Play Offs    Vancouver Canucks
36     Chicago Blackhawks  16 Apr 2015      Play Offs  Nashville Predators
37        Ottawa Senators  16 Apr 2015      Play Offs   Montreal Canadiens
38     New York Islanders  16 Apr 2015      Play Offs  Washington Capitals
39        Edmonton Oilers  12 Apr 2015  Not specified    Vancouver Canucks
40          Anaheim Ducks  12 Apr 2015  Not specified      Arizona Coyotes
41     Chicago Blackhawks  12 Apr 2015  Not specified   Colorado Avalanche
42    Nashville Predators  12 Apr 2015  Not specified         Dallas Stars
43          Boston Bruins  12 Apr 2015  Not specified  Tampa Bay Lightning
44    Pittsburgh Penguins  12 Apr 2015  Not specified       Buffalo Sabres
45      Detroit Red Wings  12 Apr 2015  Not specified  Carolina Hurricanes
46      New Jersey Devils  12 Apr 2015  Not specified     Florida Panthers
47  Columbus Blue Jackets  12 Apr 2015  Not specified   New York Islanders
48     Montreal Canadiens  12 Apr 2015  Not specified  Toronto Maple Leafs
49         Calgary Flames  11 Apr 2015  Not specified        Winnipeg Jets

这篇关于重新格式化报废的硒表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆