Python硒多处理 [英] Python selenium multiprocessing

查看:96
本文介绍了Python硒多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用python与硒结合编写了一个脚本,以从其着陆页中抓取不同文章的链接,并通过跟踪引向其内页的URL最终获得每个文章的标题.尽管我在这里解析的内容是静态内容,但我还是使用了硒来查看它在多处理中的工作方式.

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

但是,我的目的是使用多处理程序进行抓取.到目前为止,我知道硒不支持多处理,但似乎我错了.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

我的问题:在使用多处理程序运行硒时,如何减少使用硒的执行时间?

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try (it's a working one) :

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))

推荐答案

在使用多处理程序运行硒时,如何减少使用硒的执行时间

how can I reduce the execution time using selenium when it is made to run using multiprocessing

解决方案中的很多时间都花在为每个URL启动Webdriver上.您可以通过每个线程仅启动一次驱动程序来减少此时间:

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver


def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

在我的系统上,这将时间从1m7s减少到仅24.895s,减少了约35%.要测试自己,请下载完整脚本.

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

注意:ThreadPool使用受Python GIL约束的线程.如果大多数情况下任务是受I/O约束的,那没关系.根据您对抓取的结果进行的后处理,您可能需要使用multiprocessing.Pool.这将启动并行进程,这些进程作为一个整体不受GIL的约束.其余代码保持不变.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.

这篇关于Python硒多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆