如何使用selenium的几个实例[python] [英] How to use several instance of selenium [python]

查看:15
本文介绍了如何使用selenium的几个实例[python]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 selenium 进行网页抓取,但速度太慢,所以我尝试使用实例来加快速度.

I'm using selenium for web scraping but it's too slow so I'm trying to use to instance to speed it up.

我想要完成的是:

1) 创建实例_1
2)创建instance_2
3)首先打开一个页面
什么都不做
4)首先打开一个页面
保存第一个实例的内容
5)首先打开一个新页面
保存第二个实例的内容

1) create instance_1
2) create instance_2
3) Open a page in the first instance
do nothing
4) Open a page in the first instance
save the content of the first insctance
5) Open a new page in the first instance
save the content of the second instance

这个想法是利用加载第一页所需的时间来打开第二页.

The idea is to use the time that takes the first page to load to open a second one.

links = ('https:my_page'+ '&LIC=' + code.split('_')[1] for code in data)

browser = webdriver.Firefox()
browser_2 = webdriver.Firefox()


first_link = links.next()
browser.get(first_link)
time.sleep(0.5)

for i,link in enumerate(links): 

        if i % 2:       # i starts at 0
            browser_2.get(link)
            time.sleep(0.5)
            try: 
                content = browser.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content_2))

                print 'error ' + str(i) 

        else:

            browser.get(link)
            time.sleep(0.5)
            try:
                content_2 = browser_2.page_source
                name = re.findall(re.findall('&LIC=(.+)&SAW',link)[0]
                with open(output_path  + name,'w') as output:
                    output.write((content ))

            except:
                print 'error ' + str(i) 

但是脚本是等待第一页完全充电才打开打开下一页,而且这种方式也仅限于同一时间的页面

But the script is waiting to the first page to charge completely before open open the next one, also this approach is bounded to only to page at the same time

编辑.

我对 GIRISH RAMNANI 的代码做了如下修改

I made the following changes to the code of GIRISH RAMNANI

driver_1 = webdriver.Firefox()
driver_2 = webdriver.Firefox()
driver_3 = webdriver.Firefox()

drivers_instance = [driver_1,driver_2,driver_3]

使用驱动程序和 url 作为函数的输入

 def get_content(url,driver):    
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")

使用zip功能创建一对链接/浏览器

with ThreadPoolExecutor(max_workers=2) as ex:
    zip_list = zip(links, cycle(drivers_instance)) if len(links) > len(drivers_instance) else zip(cycle(links), drivers_instance)
    for par in zip_list:

       futures.append(ex.submit(get_content,par[0],par[1]))

推荐答案

concurrent.futures 的使用可以在这里完成.

use of concurrent.futures can be done here.

from selenium import webdriver
from concurrent.futures import ThreadPoolExecutor

URL ="https://pypi.python.org/pypi/{}"

li =["pywp/1.3","augploy/0.3.5"]

def get_content(url):    
    driver = webdriver.Firefox()
    driver.get(url)
    tag = driver.find_element_by_tag_name("a")
    # do your work here and return the result
    return tag.get_attribute("href")


li = list(map(lambda link: URL.format(link), li ))


futures = []
with ThreadPoolExecutor(max_workers=2) as ex:
    for link in li:

        futures.append(ex.submit(get_content,link))

for future in futures:
    print(future.result())

请记住,将启动两个 Firefox 实例.

Keep in mind that two instances of firefox will start.

注意:您可能希望使用无头浏览器,例如 PhantomJs 而不是 firefox.

Note: you might want to use headless browsers such as PhantomJs instead of firefox.

这篇关于如何使用selenium的几个实例[python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆