带有多个网址的硒抓取 [英] Selenium scraping with multiple urls
问题描述
在我之前的问题之后,我现在正在尝试抓取网址的多个页面(给定季节中所有包含游戏的页面).我还尝试抓取多个父网址(季节):
Following my previous question, i'm now trying to scrape multiple pages of a url (all the pages with games in a given season). I'm also trying to scrape multiple parent urls (seasons):
from selenium import webdriver
import pandas as pd
import time
url = ['http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',
'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/']
data = []
for i in url:
for j in range(1,8):
print i+str(j)
driver = webdriver.PhantomJS()
driver.implicitly_wait(10)
driver.get(i+str(j))
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
time.sleep(3)
print str(j)+" was ok"
df = pd.DataFrame(data)
print df
# ok for six results then socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
# ok for two results, then infinite load
# added time.sleep(3)
# ok for first result, infinite load after that
# added implicitly wait
# no result, infinite load
首先,我两次尝试代码,但没有在第14行隐式等待,也没有在35处进行睡眠.第一个结果给出了套接字错误.在抓取了两个良好的页面后,第二个结果停顿而没有错误.
At first I tried the code twice without either the implicit wait on line 14 or the sleep on 35. First result gave the socket error. Second result stalled with no error after two good scraped pages.
然后如上所述添加时间等待,但他们没有帮助.
Then added the time waits as noted above and they haven't helped.
由于结果不一致,我猜测是在循环末尾之间重置连接&;下一次运行.我想知道这是否是可能的解决方案以及如何实施.我检查了网站的robots.txt,在设定的时间间隔后看不到任何阻止抓取的内容.
Since the results are not consistent, my guess is connection be reset between the end of the loop & next run. I'd like to know if that's a likely solution and how to implement. I checked the robots.txt of the site and can't see anything that prevents scraping after a set interval.
第二,说刮板获取90%的页面,然后停顿(无限等待).有没有办法让它在x秒后重试循环,以保存您已有的内容并从停顿点重试?
Secondly, say the scraper gets 90% of the pages, then stalls (infinite wait). Is there a way to have it retry that loop after x seconds so as to save what you've got and retry from the stalled point again?
推荐答案
您需要做的是:
- 重用相同的
webdriver
实例-请勿在循环中初始化它 - 引入显式等待-这肯定会使代码更可靠又快
- reuse the same
webdriver
instance - do not initialize it in the loop - introduce Explicit Waits - this would definitely make the code more reliable and fast
实施:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
urls = [
'http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',
'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/'
]
data = []
driver = webdriver.PhantomJS()
driver.implicitly_wait(10)
wait = WebDriverWait(driver, 10)
for url in urls:
for page in range(1, 8):
driver.get(url + str(page))
# wait for the page to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#tournamentTable tr.deactivate")))
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text
if " - " in date:
date, event = date.split(" - ")
else:
event = "Not specified"
data.append({
"home": home.strip(),
"away": away.strip(),
"date": date.strip(),
"event": event.strip()
})
driver.close()
df = pd.DataFrame(data)
print(df)
打印:
away date event home
0 Salzburg 14 Apr 2015 Play Offs Vienna Capitals
1 Vienna Capitals 12 Apr 2015 Play Offs Salzburg
2 Salzburg 10 Apr 2015 Play Offs Vienna Capitals
3 Vienna Capitals 07 Apr 2015 Play Offs Salzburg
4 Vienna Capitals 31 Mar 2015 Play Offs Liwest Linz
5 Salzburg 29 Mar 2015 Play Offs Klagenfurt
6 Liwest Linz 29 Mar 2015 Play Offs Vienna Capitals
7 Klagenfurt 26 Mar 2015 Play Offs Salzburg
8 Vienna Capitals 26 Mar 2015 Play Offs Liwest Linz
9 Liwest Linz 24 Mar 2015 Play Offs Vienna Capitals
10 Salzburg 24 Mar 2015 Play Offs Klagenfurt
11 Klagenfurt 22 Mar 2015 Play Offs Salzburg
12 Vienna Capitals 22 Mar 2015 Play Offs Liwest Linz
13 Bolzano 20 Mar 2015 Play Offs Liwest Linz
14 Fehervar AV19 18 Mar 2015 Play Offs Vienna Capitals
15 Liwest Linz 17 Mar 2015 Play Offs Bolzano
16 Vienna Capitals 16 Mar 2015 Play Offs Fehervar AV19
17 Villach 15 Mar 2015 Play Offs Salzburg
18 Klagenfurt 15 Mar 2015 Play Offs Znojmo
19 Bolzano 15 Mar 2015 Play Offs Liwest Linz
20 Liwest Linz 13 Mar 2015 Play Offs Bolzano
21 Fehervar AV19 13 Mar 2015 Play Offs Vienna Capitals
22 Znojmo 13 Mar 2015 Play Offs Klagenfurt
23 Salzburg 13 Mar 2015 Play Offs Villach
24 Klagenfurt 10 Mar 2015 Play Offs Znojmo
25 Vienna Capitals 10 Mar 2015 Play Offs Fehervar AV19
26 Bolzano 10 Mar 2015 Play Offs Liwest Linz
27 Villach 10 Mar 2015 Play Offs Salzburg
28 Liwest Linz 08 Mar 2015 Play Offs Bolzano
29 Znojmo 08 Mar 2015 Play Offs Klagenfurt
.. ... ... ... ...
670 TWK Innsbruck 28 Sep 2013 Not specified Znojmo
671 Liwest Linz 27 Sep 2013 Not specified Dornbirn
672 Bolzano 27 Sep 2013 Not specified Graz 99ers
673 Klagenfurt 27 Sep 2013 Not specified Olimpija Ljubljana
674 Fehervar AV19 27 Sep 2013 Not specified Salzburg
675 TWK Innsbruck 27 Sep 2013 Not specified Vienna Capitals
676 Villach 27 Sep 2013 Not specified Znojmo
677 Salzburg 24 Sep 2013 Not specified Olimpija Ljubljana
678 Dornbirn 22 Sep 2013 Not specified TWK Innsbruck
679 Graz 99ers 22 Sep 2013 Not specified Klagenfurt
680 Vienna Capitals 22 Sep 2013 Not specified Villach
681 Fehervar AV19 21 Sep 2013 Not specified Bolzano
682 Dornbirn 20 Sep 2013 Not specified Bolzano
683 Villach 20 Sep 2013 Not specified Graz 99ers
684 Znojmo 20 Sep 2013 Not specified Klagenfurt
685 Olimpija Ljubljana 20 Sep 2013 Not specified Liwest Linz
686 Fehervar AV19 20 Sep 2013 Not specified TWK Innsbruck
687 Salzburg 20 Sep 2013 Not specified Vienna Capitals
688 Villach 15 Sep 2013 Not specified Klagenfurt
689 Liwest Linz 15 Sep 2013 Not specified Dornbirn
690 Vienna Capitals 15 Sep 2013 Not specified Fehervar AV19
691 TWK Innsbruck 15 Sep 2013 Not specified Salzburg
692 Graz 99ers 15 Sep 2013 Not specified Znojmo
693 Olimpija Ljubljana 14 Sep 2013 Not specified Dornbirn
694 Bolzano 14 Sep 2013 Not specified Fehervar AV19
695 Klagenfurt 13 Sep 2013 Not specified Graz 99ers
696 Znojmo 13 Sep 2013 Not specified Salzburg
697 Olimpija Ljubljana 13 Sep 2013 Not specified TWK Innsbruck
698 Bolzano 13 Sep 2013 Not specified Vienna Capitals
699 Liwest Linz 13 Sep 2013 Not specified Villach
[700 rows x 4 columns]
这篇关于带有多个网址的硒抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!