连接池已满,通过Selenium和Python丢弃与ThreadPoolExecutor和多个无头浏览器的连接 [英] Connection pool is full, discarding connection with ThreadPoolExecutor and multiple headless browsers through Selenium and Python

查看:123
本文介绍了连接池已满,通过Selenium和Python丢弃与ThreadPoolExecutor和多个无头浏览器的连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 selenium==3.141.0python 3.6.7chromedriver 2.44 编写一些自动化软件.

大部分逻辑都可以由单个浏览器实例执行,但对于某些部分,我必须启动 10-20 个实例才能获得不错的执行速度.

一旦涉及到由 ThreadPoolExecutor 执行的部分,浏览器交互开始抛出此错误:

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|连接池已满,丢弃连接:127.0.0.1警告|05/Dec/2018 17:33:11|connectionpool|urlopen|662|重试(重试(总计=2,连接=无,读取=无,重定向=无,状态=无))连接被协议错误"破坏后('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))':/session/119df5b95710793a0421c13ec3a83847/url警告|05/Dec/2018 17:33:11|connectionpool|urlopen|662|重试(重试(总计=1,连接=无,读取=无,重定向=无,状态=无))连接被NewConnectionError"破坏('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: 建立新连接失败:[Errno 111] 连接被拒绝',)':/session/119df5b95710793a0421c13ec3a83847/url

浏览器设置:

def init_chromedriver(cls):尝试:chrome_options = webdriver.ChromeOptions()chrome_options.add_argument('--headless')chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")prefs = {profile.managed_default_content_settings.images":2}chrome_options.add_experimental_option("prefs", prefs)driver = webdriver.Chrome(driver_paths['chrome'],chrome_options=chrome_options,service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])driver.implicitly_wait(10)回程司机除了作为 e 的例外:logger.error(e)

相关代码:

ProfileParser 实例化一个 webdriver 并执行一些页面交互.我想交互本身并不相关,因为没有 ThreadPoolExecutor 一切都可以工作.但是,简而言之:

class ProfileParser(object):def __init__(self, acc):self.driver = Utils.init_chromedriver()def __exit__(self, exc_type, exc_val, exc_tb):Utils.shutdown_chromedriver(self.driver)self.driver = 无收集用户信息(post_url)self.driver.get(post_url)profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

ThreadPoolExecutor中运行时,上面的错误出现在这个点self.driver.find_element_by_xpathself.driver.get

这是有效的:

 使用 ProfileParser(acc) 作为解析器:pparser.collect_user_info(posts[0])

这些选项不起作用:(连接池错误)

期货 = []#一个工人,一个未来使用 ThreadPoolExecutor(max_workers=1) 作为执行器:使用 ProfileParser(acc) 作为解析器:futures.append(executor.submit(pparser.collect_user_info,posts[0]))#10 工人,多个期货使用 ThreadPoolExecutor(max_workers=10) 作为执行器:对于帖子中的 p:使用 ProfileParser(acc) 作为解析器:futures.append(executor.submit(pparser.collect_user_info, p))

更新:

我找到了一个临时解决方案(它不会使这个初始问题无效) - 在 ProfileParser 类之外实例化一个 webdriver.不知道为什么它起作用,但初始不起作用.我想某些语言细节的原因?感谢您的回答,但是问题似乎不在于 ThreadPoolExecutor max_workers 限制 - 正如您在我尝试提交单个实例的选项之一中看到的那样还是不行.

当前的解决方法:

期货 = []使用 ThreadPoolExecutor(max_workers=10) 作为执行器:对于帖子中的 p:驱动程序 = Utils.init_chromedriver()期货.附加({'未来':executor.submit(collect_user_info, driver, acc, p),'司机':司机})对于期货中的 f:f['未来'].done()Utils.shutdown_chromedriver(f['driver'])

解决方案

此错误信息...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|连接池已满,丢弃连接:127.0.0.1警告|05/Dec/2018 17:33:11|connectionpool|urlopen|662|重试(重试(总计=2,连接=无,读取=无,重定向=无,状态=无))连接被协议错误"破坏后('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))':/session/119df5b95710793a0421c13ec3a83847/url警告|05/Dec/2018 17:33:11|connectionpool|urlopen|662|重试(重试(总计=1,连接=无,读取=无,重定向=无,状态=无))连接被NewConnectionError"破坏('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: 建立新连接失败:[Errno 111] 连接被拒绝',)':/session/119df5b95710793a0421c13ec3a83847/url

...似乎是 urllib3 的连接池中的一个问题,它在执行 def _put_conn(self, conn)警告connectionpool.py 中的 code> 方法.

def _put_conn(self, conn):"""将连接放回池中.:参数连接:返回的当前主机和端口的连接对象:meth:`._new_conn` 或 :meth:`._get_conn`.如果池已满,则关闭并丢弃连接因为我们超过了 maxsize.如果连接经常被丢弃,那么 maxsize 应该增加.如果池关闭,则连接将关闭并丢弃."""尝试:self.pool.put(conn, block=False)return # 一切都是花花公子,完成.除了属性错误:# self.pool 是 None.经过除了 queue.Full:# 如果 self.block == True,这应该永远不会发生日志警告("连接池已满,正在放弃连接:%s",self.host)# 连接从未放回池中,关闭它.如果连接:conn.close()

<小时>

ThreadPoolExecutor

ThreadPoolExecutor 是一个 Executor 使用线程池执行的子类异步调用.当与 Future 关联的可调用对象等待另一个 Future 的结果时,可能会发生死锁.

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())

  • 一个 Executor 子类,它使用最多包含 max_workers 个线程的池来异步执行调用.
  • initializer 是一个可选的可调用对象,在每个工作线程开始时调用;initargs 是传递给初始化程序的参数元组.如果初始化程序引发异常,所有当前挂起的作业都会引发 BrokenThreadPool,以及任何向池提交更多作业的尝试.
  • 从 3.5 版本开始:如果 max_workers 为 None 或不给定,则默认为机器上的处理器数量,乘以 5,假设经常使用 ThreadPoolExecutor 重叠 I/O 而不是 CPU 工作和数量工作人员的数量应高于 ProcessPoolExecutor 的工作人员数量.
  • 从 3.6 版本开始:添加了 thread_name_prefix 参数以允许用户控制线程.池创建的工作线程的线程名称以便于调试.
  • 从 3.7 版开始:添加了初始化程序和 initargs 参数.

根据您的问题,当您尝试启动 10-20 个实例时,10默认连接池大小 在您的情况下似乎还不够,这是硬编码的adapters.py.

此外,@EdLeafe 在讨论中 获取错误:连接池已满,丢弃连接提到:

<块引用>

看起来在请求代码中,None 对象是正常的.如果 _get_conn() 从池中获取 None,它只是创建一个新连接.然而,它应该从所有这些 None 对象开始,而且 _put_conn() 不够聪明,无法用连接替换 None ,这似乎很奇怪.

但是合并 将池大小参数添加到客户端构造函数 已修复此问题问题.

解决方案

增加10默认连接池大小,这是之前在adapters.py 现在可配置将解决您的问题.

<小时>

更新

根据您的评论更新...提交一个实例,结果是相同的....根据讨论中的@meferguson84 获取错误:连接池已满,丢弃连接:

<块引用>

我进入代码直到它安装适配器只是为了调整池大小,看看它是否有所作为.我发现队列中充满了 NoneType 对象,实际上传连接是列表中的最后一项.该列表有 10 个项目(这是有道理的).没有意义的是池的 unfinished_tasks 参数是 11.当队列本身只有 11 个项目时,这怎么可能?此外,队列中充满 NoneType 对象并且我们使用的连接是列表中的最后一项是否正常?

这听起来也是您用例中的一个可能原因.这听起来可能有些多余,但您仍然可以执行以下几个临时步骤:

I'm writing some automation software using selenium==3.141.0, python 3.6.7, chromedriver 2.44.

Most of the the logic is ok to be executed by the single browser instance, but for some part i have to launch 10-20 instances to have a decent execution speed.

Once it comes to the part which is executed by ThreadPoolExecutor, browser interactions start throwing this error:

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

browser setup:

def init_chromedriver(cls):
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")
        prefs = {"profile.managed_default_content_settings.images": 2}
        chrome_options.add_experimental_option("prefs", prefs)

        driver = webdriver.Chrome(driver_paths['chrome'],
                                       chrome_options=chrome_options,
                                       service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])
        driver.implicitly_wait(10)

        return driver
    except Exception as e:
        logger.error(e)

relevant code:

ProfileParser instantiates a webdriver and execute a few page interactions. I suppose the interactions themselves are not relevant because everything works without ThreadPoolExecutor. However, in short:

class ProfileParser(object):
    def __init__(self, acc):
        self.driver = Utils.init_chromedriver()
    def __exit__(self, exc_type, exc_val, exc_tb):
        Utils.shutdown_chromedriver(self.driver)
        self.driver = None

    collect_user_info(post_url)
           self.driver.get(post_url)
           profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

While runs in ThreadPoolExecutor, the error above appears at this point self.driver.find_element_by_xpath or at self.driver.get

this is working:

with ProfileParser(acc) as pparser:
        pparser.collect_user_info(posts[0])

these options are not working: (connectionpool errors)

futures = []
#one worker, one future
with ThreadPoolExecutor(max_workers=1) as executor:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, posts[0]))

#10 workers, multiple futures
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, p))

UPDATE:

I found a temporal solution (which does not invalidate this initial question) - to instantiate a webdriver outside of ProfileParser class. Don't know why it works but the initial does not. I suppose the cause in some language specifics? Thanks for answers, however it doesn't seem like the problem is with the ThreadPoolExecutor max_workers limit - as you see in one of the options i tried to submit a single instance and it is still didn't work.

current workaround:

futures = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        driver = Utils.init_chromedriver()
        futures.append({
            'future': executor.submit(collect_user_info, driver, acc, p),
            'driver': driver
        })

for f in futures:
    f['future'].done()
    Utils.shutdown_chromedriver(f['driver'])

解决方案

This error message...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

...seems to be an issue in urllib3's connection pooling which raised these WARNING while executing the def _put_conn(self, conn) method in connectionpool.py.

def _put_conn(self, conn):
    """
    Put a connection back into the pool.

    :param conn:
        Connection object for the current host and port as returned by
        :meth:`._new_conn` or :meth:`._get_conn`.

    If the pool is already full, the connection is closed and discarded
    because we exceeded maxsize. If connections are discarded frequently,
    then maxsize should be increased.

    If the pool is closed, then the connection will be closed and discarded.
    """
    try:
        self.pool.put(conn, block=False)
        return  # Everything is dandy, done.
    except AttributeError:
        # self.pool is None.
        pass
    except queue.Full:
        # This should never happen if self.block == True
        log.warning(
            "Connection pool is full, discarding connection: %s",
            self.host)

    # Connection never got put back into the pool, close it.
    if conn:
        conn.close()


ThreadPoolExecutor

ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously. Deadlocks can occur when the callable associated with a Future waits on the results of another Future.

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())

  • An Executor subclass that uses a pool of at most max_workers threads to execute calls asynchronously.
  • initializer is an optional callable that is called at the start of each worker thread; initargs is a tuple of arguments passed to the initializer. Should initializer raise an exception, all currently pending jobs will raise a BrokenThreadPool, as well as any attempt to submit more jobs to the pool.
  • From version 3.5 onwards: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
  • From version 3.6 onwards: The thread_name_prefix argument was added to allow users to control the threading.Thread names for worker threads created by the pool for easier debugging.
  • From version 3.7: Added the initializer and initargs arguments.

As per your question as you are trying to launch 10-20 instances the default connection pool size of 10 seems not to be enough in your case which is hardcoded in adapters.py.

Moreover, @EdLeafe in the discussion Getting error: Connection pool is full, discarding connection mentions:

It looks like within the requests code, None objects are normal. If _get_conn() gets None from the pool, it simply creates a new connection. It seems odd, though, that it should start with all those None objects, and that _put_conn() isn't smart enough to replace None with the connection.

However the merge Add pool size parameter to client constructor have fixed this issue.

Solution

Increasing the default connection pool size of 10 which was earlier hardcoded in adapters.py and now configurable will solve your issue.


Update

As per your comment update ...submit a single instance and the outcome is the same.... as per @meferguson84 within the discussion Getting error: Connection pool is full, discarding connection:

I stepped into the code to the point where it mounts the adapter just to play with the pool size and see if it made a difference. What I found was that the queue is full of NoneType objects with the actual upload connection being the last item in the list. The list is 10 items long (which makes sense). What doesn't make sense is that the unfinished_tasks parameter for the pool is 11. How can this be when the queue itself is only 11 items? Also, is it normal for the queue to be full of NoneType objects with the connection we are using being the last item on the list?

That sounds like a possible cause in your usecase as well. It may sound redundant but you may still perform a couple of ad-hoc steps as follows:

这篇关于连接池已满,通过Selenium和Python丢弃与ThreadPoolExecutor和多个无头浏览器的连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆