带有多个网址的硒抓取 [英] Selenium scraping with multiple urls

查看:91
本文介绍了带有多个网址的硒抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我之前的问题之后,我现在正在尝试抓取网址的多个页面(给定季节中所有包含游戏的页面).我还尝试抓取多个父网址(季节):

Following my previous question, i'm now trying to scrape multiple pages of a url (all the pages with games in a given season). I'm also trying to scrape multiple parent urls (seasons):

from selenium import webdriver
import pandas as pd
import time

url = ['http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/', 
       'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/']

data = []

for i in url:
    for j in range(1,8):
        print i+str(j)        
        driver = webdriver.PhantomJS()        
        driver.implicitly_wait(10)        
        driver.get(i+str(j))


        for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
            home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
            date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text

            if " - " in date:
                date, event = date.split(" - ")
            else:
                event = "Not specified"

            data.append({
                "home": home.strip(),
                "away": away.strip(),
                "date": date.strip(),
                "event": event.strip()
            })

        driver.close()
        time.sleep(3)
        print str(j)+" was ok"

df = pd.DataFrame(data)
print df

# ok for six results then socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
# ok for two results, then infinite load
# added time.sleep(3)
# ok for first result, infinite load after that
# added implicitly wait
# no result, infinite load

首先,我两次尝试代码,但没有在第14行隐式等待,也没有在35处进行睡眠.第一个结果给出了套接字错误.在抓取了两个良好的页面后,第二个结果停顿而没有错误.

At first I tried the code twice without either the implicit wait on line 14 or the sleep on 35. First result gave the socket error. Second result stalled with no error after two good scraped pages.

然后如上所述添加时间等待,但他们没有帮助.

Then added the time waits as noted above and they haven't helped.

由于结果不一致,我猜测是在循环末尾之间重置连接&;下一次运行.我想知道这是否是可能的解决方案以及如何实施.我检查了网站的robots.txt,在设定的时间间隔后看不到任何阻止抓取的内容.

Since the results are not consistent, my guess is connection be reset between the end of the loop & next run. I'd like to know if that's a likely solution and how to implement. I checked the robots.txt of the site and can't see anything that prevents scraping after a set interval.

第二,说刮板获取90%的页面,然后停顿(无限等待).有没有办法让它在x秒后重试循环,以保存您已有的内容并从停顿点重试?

Secondly, say the scraper gets 90% of the pages, then stalls (infinite wait). Is there a way to have it retry that loop after x seconds so as to save what you've got and retry from the stalled point again?

推荐答案

您需要做的是:

  • 重用相同的webdriver实例-请勿在循环中初始化它
  • 引入显式等待-这肯定会使代码更可靠又快
  • reuse the same webdriver instance - do not initialize it in the loop
  • introduce Explicit Waits - this would definitely make the code more reliable and fast

实施:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import pandas as pd


urls = [
    'http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',
    'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/'
]

data = []

driver = webdriver.PhantomJS()
driver.implicitly_wait(10)
wait = WebDriverWait(driver, 10)

for url in urls:
    for page in range(1, 8):
        driver.get(url + str(page))
        # wait for the page to load
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#tournamentTable tr.deactivate")))

        for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
            home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
            date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text

            if " - " in date:
                date, event = date.split(" - ")
            else:
                event = "Not specified"

            data.append({
                "home": home.strip(),
                "away": away.strip(),
                "date": date.strip(),
                "event": event.strip()
            })

driver.close()

df = pd.DataFrame(data)
print(df)

打印:

                   away         date          event                home
0              Salzburg  14 Apr 2015      Play Offs     Vienna Capitals
1       Vienna Capitals  12 Apr 2015      Play Offs            Salzburg
2              Salzburg  10 Apr 2015      Play Offs     Vienna Capitals
3       Vienna Capitals  07 Apr 2015      Play Offs            Salzburg
4       Vienna Capitals  31 Mar 2015      Play Offs         Liwest Linz
5              Salzburg  29 Mar 2015      Play Offs          Klagenfurt
6           Liwest Linz  29 Mar 2015      Play Offs     Vienna Capitals
7            Klagenfurt  26 Mar 2015      Play Offs            Salzburg
8       Vienna Capitals  26 Mar 2015      Play Offs         Liwest Linz
9           Liwest Linz  24 Mar 2015      Play Offs     Vienna Capitals
10             Salzburg  24 Mar 2015      Play Offs          Klagenfurt
11           Klagenfurt  22 Mar 2015      Play Offs            Salzburg
12      Vienna Capitals  22 Mar 2015      Play Offs         Liwest Linz
13              Bolzano  20 Mar 2015      Play Offs         Liwest Linz
14        Fehervar AV19  18 Mar 2015      Play Offs     Vienna Capitals
15          Liwest Linz  17 Mar 2015      Play Offs             Bolzano
16      Vienna Capitals  16 Mar 2015      Play Offs       Fehervar AV19
17              Villach  15 Mar 2015      Play Offs            Salzburg
18           Klagenfurt  15 Mar 2015      Play Offs              Znojmo
19              Bolzano  15 Mar 2015      Play Offs         Liwest Linz
20          Liwest Linz  13 Mar 2015      Play Offs             Bolzano
21        Fehervar AV19  13 Mar 2015      Play Offs     Vienna Capitals
22               Znojmo  13 Mar 2015      Play Offs          Klagenfurt
23             Salzburg  13 Mar 2015      Play Offs             Villach
24           Klagenfurt  10 Mar 2015      Play Offs              Znojmo
25      Vienna Capitals  10 Mar 2015      Play Offs       Fehervar AV19
26              Bolzano  10 Mar 2015      Play Offs         Liwest Linz
27              Villach  10 Mar 2015      Play Offs            Salzburg
28          Liwest Linz  08 Mar 2015      Play Offs             Bolzano
29               Znojmo  08 Mar 2015      Play Offs          Klagenfurt
..                  ...          ...            ...                 ...
670       TWK Innsbruck  28 Sep 2013  Not specified              Znojmo
671         Liwest Linz  27 Sep 2013  Not specified            Dornbirn
672             Bolzano  27 Sep 2013  Not specified          Graz 99ers
673          Klagenfurt  27 Sep 2013  Not specified  Olimpija Ljubljana
674       Fehervar AV19  27 Sep 2013  Not specified            Salzburg
675       TWK Innsbruck  27 Sep 2013  Not specified     Vienna Capitals
676             Villach  27 Sep 2013  Not specified              Znojmo
677            Salzburg  24 Sep 2013  Not specified  Olimpija Ljubljana
678            Dornbirn  22 Sep 2013  Not specified       TWK Innsbruck
679          Graz 99ers  22 Sep 2013  Not specified          Klagenfurt
680     Vienna Capitals  22 Sep 2013  Not specified             Villach
681       Fehervar AV19  21 Sep 2013  Not specified             Bolzano
682            Dornbirn  20 Sep 2013  Not specified             Bolzano
683             Villach  20 Sep 2013  Not specified          Graz 99ers
684              Znojmo  20 Sep 2013  Not specified          Klagenfurt
685  Olimpija Ljubljana  20 Sep 2013  Not specified         Liwest Linz
686       Fehervar AV19  20 Sep 2013  Not specified       TWK Innsbruck
687            Salzburg  20 Sep 2013  Not specified     Vienna Capitals
688             Villach  15 Sep 2013  Not specified          Klagenfurt
689         Liwest Linz  15 Sep 2013  Not specified            Dornbirn
690     Vienna Capitals  15 Sep 2013  Not specified       Fehervar AV19
691       TWK Innsbruck  15 Sep 2013  Not specified            Salzburg
692          Graz 99ers  15 Sep 2013  Not specified              Znojmo
693  Olimpija Ljubljana  14 Sep 2013  Not specified            Dornbirn
694             Bolzano  14 Sep 2013  Not specified       Fehervar AV19
695          Klagenfurt  13 Sep 2013  Not specified          Graz 99ers
696              Znojmo  13 Sep 2013  Not specified            Salzburg
697  Olimpija Ljubljana  13 Sep 2013  Not specified       TWK Innsbruck
698             Bolzano  13 Sep 2013  Not specified     Vienna Capitals
699         Liwest Linz  13 Sep 2013  Not specified             Villach

[700 rows x 4 columns]

这篇关于带有多个网址的硒抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆