超过最大递归深度.多处理和bs4 [英] Maximum recursion depth exceeded. Multiprocessing and bs4

查看:269
本文介绍了超过最大递归深度.多处理和bs4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图使解析器使用beautifulSoup和多处理.我有一个错误:

I'm trying to make a parser use beautifulSoup and multiprocessing. I have an error:

RecursionError:超过最大递归深度

RecursionError: maximum recursion depth exceeded

我的代码是:

import bs4, requests, time
from multiprocessing.pool import Pool

html = requests.get('https://www.avito.ru/moskva/avtomobili/bmw/x6?sgtd=5&radius=0')
soup = bs4.BeautifulSoup(html.text, "html.parser")

divList = soup.find_all("div", {'class': 'item_table-header'})


def new_check():
    with Pool() as pool:
        pool.map(get_info, divList)

def get_info(each):
   pass

if __name__ == '__main__':
    new_check()

为什么会出现此错误以及如何解决?

Why I get this error and how I can fix it?

更新: 所有错误文字为

Traceback (most recent call last):
  File "C:/Users/eugen/PycharmProjects/avito/main.py", line 73, in <module> new_check()
  File "C:/Users/eugen/PycharmProjects/avito/main.py", line 67, in new_check
    pool.map(get_info, divList)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 424, in _handle_tasks
    put(task)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
RecursionError: maximum recursion depth exceeded

推荐答案

使用multiprocessing时,传递给工作人员的所有内容都必须为

When you use multiprocessing, everything you pass to a worker has to be pickled.

不幸的是,许多BeautifulSoup树木都不能被腌制.

Unfortunately, many BeautifulSoup trees can't be pickled.

有几个不同的原因.其中一些是自此以来已修复的错误,因此您可以可以尝试确保您拥有最新的bs4版本,而某些则特定于不同的解析器或树构建器……但是,很可能没有这样的错误将有所帮助.

There are a few different reasons for this. Some of them are bugs that have since been fixed, so you could try making sure you have the latest bs4 version, and some are specific to different parsers or tree builders… but there's a good chance nothing like this will help.

但是根本的问题是,树中的许多元素都包含对树的其余部分的引用.

But the fundamental problem is that many elements in the tree contain references to the rest of the tree.

有时,这会导致实际的无限循环,因为循环参考对于其循环参考检测而言过于间接.但这通常是可以修复的错误.

Occasionally, this leads to an actual infinite loop, because the circular references are too indirect for its circular reference detection. But that's usually a bug that gets fixed.

但是,更重要的是,即使循环不是 infinite ,它仍然可以从树的其余部分拖动1000多个元素,这已经足够引起RecursionError.

But, even more importantly, even when the loop isn't infinite, it can still drag in more than 1000 elements from all over the rest of the tree, and that's already enough to cause a RecursionError.

我认为后者就是这里发生的事情.如果我采用您的代码并尝试使divList[0]腌制,则它将失败. (如果我将递归限制向上推并计算帧,则它需要深度23080,这比默认值1000还要远.)但是如果我使用完全相同的div并将其单独解析,它将成功没问题.

And I think the latter is what's happening here. If I take your code and try to pickle divList[0], it fails. (If I bump the recursion limit way up and count the frames, it needs a depth of 23080, which is way, way past the default of 1000.) But if I take that exact same div and parse it separately, it succeeds with no problem.

因此,一种可能性是只执行sys.setrecursionlimit(25000).这将解决此确切页面的问题,但是稍有不同的页面可能需要的甚至更多. (此外,将递归限制设置得很高通常不是一个好主意-并不是因为内存浪费了太多,而是因为它意味着实际的无限递归需要25倍的时间和25倍的资源浪费来进行检测.)

So, one possibility is to just do sys.setrecursionlimit(25000). That will solve the problem for this exact page, but a slightly different page might need even more than that. (Plus, it's usually not a great idea to set the recursion limit that high—not so much because of the wasted memory, but because it means actual infinite recursion takes 25x as long, and 25x as much wasted resources, to detect.)

另一个技巧是编写修剪树"的代码,从而消除在div之前/腌制div时的任何向上链接.这是一个很好的解决方案,除了可能需要大量工作之外,还需要深入研究BeautifulSoup的工作原理,我怀疑您想这样做.

Another trick is to write code that "prunes the tree", eliminating any upward links from the div before/as you pickle it. This is a great solution, except that it might be a lot of work, and requires diving into the internals of how BeautifulSoup works, which I doubt you want to do.

最简单的解决方法有点笨拙,但是...您可以将汤转换为字符串,将其传递给孩子,然后让孩子重新解析:

The easiest workaround is a bit clunky, but… you can convert the soup to a string, pass that to the child, and have the child re-parse it:

def new_check():
    divTexts = [str(div) for div in divList]
    with Pool() as pool:
        pool.map(get_info, divTexts)

def get_info(each):
    div = BeautifulSoup(each, 'html.parser')

if __name__ == '__main__':
    new_check()

执行此操作的性能成本可能无关紧要;更大的担心是,如果您的HTML不够完善,则转换为字符串并重新解析它可能不是一次完美的往返.因此,我建议您先进行一些没有多重处理的测试,以确保这不会影响结果.

The performance cost for doing this is probably not going to matter; the bigger worry is that if you had imperfect HTML, converting to a string and re-parsing it might not be a perfect round trip. So, I'd suggest that you do some tests without multiprocessing first to make sure this doesn't affect the results.

这篇关于超过最大递归深度.多处理和bs4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆