如何使用selenium的几个实例[python] [英] How to use several instance of selenium [python]
问题描述
我正在使用 selenium 进行网页抓取,但速度太慢,所以我尝试使用实例来加快速度.
I'm using selenium for web scraping but it's too slow so I'm trying to use to instance to speed it up.
我想要完成的是:
1) 创建实例_1
2)创建instance_2
3)首先打开一个页面
什么都不做
4)首先打开一个页面
保存第一个实例的内容
5)首先打开一个新页面
保存第二个实例的内容
1) create instance_1
2) create instance_2
3) Open a page in the first instance
do nothing
4) Open a page in the first instance
save the content of the first insctance
5) Open a new page in the first instance
save the content of the second instance
这个想法是利用加载第一页所需的时间来打开第二页.
The idea is to use the time that takes the first page to load to open a second one.
links = ('https:my_page'+ '&LIC=' + code.split('_')[1] for code in data)
browser = webdriver.Firefox()
browser_2 = webdriver.Firefox()
first_link = links.next()
browser.get(first_link)
time.sleep(0.5)
for i,link in enumerate(links):
if i % 2: # i starts at 0
browser_2.get(link)
time.sleep(0.5)
try:
content = browser.page_source
name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
with open(output_path + name,'w') as output:
output.write((content_2))
print 'error ' + str(i)
else:
browser.get(link)
time.sleep(0.5)
try:
content_2 = browser_2.page_source
name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
with open(output_path + name,'w') as output:
output.write((content ))
except:
print 'error ' + str(i)
但是脚本是等待第一页完全充电才打开打开下一页,而且这种方式也仅限于同一时间的页面
But the script is waiting to the first page to charge completely before open open the next one, also this approach is bounded to only to page at the same time
编辑.
我对 GIRISH RAMNANI 的代码做了如下修改
I made the following changes to the code of GIRISH RAMNANI
driver_1 = webdriver.Firefox()
driver_2 = webdriver.Firefox()
driver_3 = webdriver.Firefox()
drivers_instance = [driver_1,driver_2,driver_3]
使用驱动程序和 url 作为函数的输入
def get_content(url,driver):
driver.get(url)
tag = driver.find_element_by_tag_name("a")
# do your work here and return the result
return tag.get_attribute("href")
使用zip功能创建一对链接/浏览器
with ThreadPoolExecutor(max_workers=2) as ex:
zip_list = zip(links, cycle(drivers_instance)) if len(links) > len(drivers_instance) else zip(cycle(links), drivers_instance)
for par in zip_list:
futures.append(ex.submit(get_content,par[0],par[1]))
推荐答案
concurrent.futures
的使用可以在这里完成.
use of concurrent.futures
can be done here.
from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor
URL ="https://pypi.python.org/pypi/{}"
li =["pywp/1.3","augploy/0.3.5"]
def get_content(url):
driver = webdriver.Firefox()
driver.get(url)
tag = driver.find_element_by_tag_name("a")
# do your work here and return the result
return tag.get_attribute("href")
li = list(map(lambda link: URL.format(link), li ))
futures = []
with ThreadPoolExecutor(max_workers=2) as ex:
for link in li:
futures.append(ex.submit(get_content,link))
for future in futures:
print(future.result())
请记住,将启动两个 Firefox 实例.
Keep in mind that two instances of firefox will start.
注意:您可能希望使用无头浏览器,例如 PhantomJs
而不是 firefox.
Note: you might want to use headless browsers such as PhantomJs
instead of firefox.
这篇关于如何使用selenium的几个实例[python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!