如何在并行处理期间重用 selenium 驱动程序实例? [英] How to reuse a selenium driver instance during parallel processing?
问题描述
为了抓取 URL 池,我使用 joblib 并行处理 selenium.在这种情况下,我面临两个挑战:
To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
- 挑战 1 是加快这一过程.目前,我的代码为每个 URL 打开和关闭一个驱动程序实例(理想情况下每个进程一个)
- 挑战 2 是摆脱 CPU 密集型 while 循环,我认为我需要在空结果上
continue
(我知道这很可能是错误的)
- Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
- Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to
continue
on empty results (I know that this is most likely wrong)
伪代码:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
我的理解是,为了在迭代中重用驱动程序实例,#Set up driver
行需要放置在 scrape(URL)
之外.但是,scrape(URL)
之外的所有内容都无法到达 joblib 的 Parallel(n_jobs = 40)
.这意味着您不能在使用 joblib 进行抓取时重用驱动程序实例,这不正确.
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver
-line needs to be placed outside scrape(URL)
. However, everything outside scrape(URL)
will not find its way to joblib's Parallel(n_jobs = 40)
. This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1:上面例子中如何在并行处理过程中重用驱动程序实例?
问题 2:如何在保持上述示例中的功能的同时摆脱 while 循环?
注意:firefox_profile
中禁用 Flash 和图像加载(代码未显示)
Note: Flash and image loading is disabled in firefox_profile
(code not shown)
推荐答案
1) 您应该首先创建一组驱动程序:每个进程一个.并将实例传递给工作人员.我不知道如何将驱动程序传递给 Prallel 对象,但您可以使用 threading.current_thread().name
键来识别驱动程序.为此,请使用 backend="threading"
.所以现在每个线程都有自己的驱动程序.
1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name
key to identify drivers. To do that, use backend="threading"
. So now each thread will has its own driver.
2) 你根本不需要循环.并行对象本身会遍历您的所有网址(我希望我真的理解您使用循环的意图)
2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)
import threading
from joblib import Parallel, delayed
from selenium import webdriver
def scrape(URL):
try:
driver = drivers[threading.current_thread().name]
except KeyError:
drivers[threading.current_thread().name] = webdriver.Firefox()
driver = drivers[threading.current_thread().name]
driver.get(URL)
results = do_something(driver)
if results:
safe_results("results.csv")
drivers = {}
Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
for driver in drivers.values():
driver.quit()
但我真的不认为使用 n_job 比使用 CPU 获得的利润更多.所以 n_jobs=-1
是最好的(当然我可能错了,试试吧).
But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1
is the best (of course I may be wrong, try it).
这篇关于如何在并行处理期间重用 selenium 驱动程序实例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!