Python:调用 Python 对象时超出了最大递归深度 [英] Python: maximum recursion depth exceeded while calling a Python object

查看:44
本文介绍了Python:调用 Python 对象时超出了最大递归深度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个爬虫,它必须在大约 500 万个页面上运行(通过增加 url ID),然后解析包含我需要的信息的页面.

I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need.

使用在 url (200K) 上运行的算法并保存好的和坏的结果后,我发现我浪费了很多时间.我可以看到有一些返回的减数可以用来检查下一个有效的 url.

after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.

你可以很快地看到减数(少数第一个好ID"中的一个小例子)-

you can see the subtrahends quite fast (a little ex' of the few first "good IDs") -

510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201

在抓取了大约 200K 的 url 后,我只得到了 14K 的好结果,我知道我在浪费时间并需要优化它,所以我运行了一些统计数据并构建了一个函数来检查 url 同时增加 818 的 id178(顶部返回减数)等'.

after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 818178 (top returning subtrahends ) etc'.

这是功能 -

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                checkNextID(ID + 8)
                return 0
            if isValid(ID + 18):
                parseHTML(curRes)
                checkNextID(ID + 18)
                return 0
            if isValid(ID + 7):
                parseHTML(curRes)
                checkNextID(ID + 7)
                return 0
            if isValid(ID + 17):
                parseHTML(curRes)
                checkNextID(ID + 17)
                return 0
            if isValid(ID+6):
                parseHTML(curRes)
                checkNextID(ID + 6)
                return 0
            if isValid(ID + 16):
                parseHTML(curRes)
                checkNextID(ID + 16)
                return 0
            else:
                checkNextID(ID + 1)
                return 0
        except Exception, e:
            print "somethin went wrong: " + str(e)

基本上做的是 -checkNextID(ID) 获取我知道的第一个包含负 8 数据的 id,因此第一次迭代将匹配第一个if isValid"子句(isValid(ID + 8) 将返回 True).

what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first "if isValid" clause (isValid(ID + 8) will return True).

lastResult 是保存最后一个已知 url id 的变量,因此我们将一直运行直到 numOfRuns 为

lastResult is a variable which saves the last known url id, so we'll run until numOfRuns is

isValid() 是一个函数,它获取 ID + 一个减数,如果 url 包含我需要的内容,则返回 True 并将 url 的汤对象保存到名为 - 的全局变量中curRes',如果 url 不包含我需要的数据,它会返回 False.

isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named - 'curRes', it returns False if the url doesn't contain the data I need.

parseHTML 是一个函数,它获取汤对象 (curRes),解析我需要的数据,然后将数据保存到 csv,然后返回 True.

parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.

如果 isValid() 返回 True,我们将调用 parseHTML() 然后尝试检查下一个 ID+subtrahends(通过调用 checkNextID(ID + subtrahends),如果它们都不会返回我正在寻找的我会将它增加 1 并再次检查,直到找到下一个有效的 url.

if isValid() returns True, we'll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I'm looking for I'll increase it with 1 and check again until I'll find the next valid url.

您可以在此处

运行代码后,我得到了大约 950~ 好的结果,突然出现了异常 -

after running the code I got about 950~ good results and suddenly an exception had raised -

"出现问题:调用 a 时超出了最大递归深度Python对象"

"somethin went wrong: maximum recursion depth exceeded while calling a Python object"

我可以在 WireShark 上看到 scipt 停留在 id - 510009541(我用 510000003 开始​​我的脚本),脚本尝试使用该 ID 获取 url 几次,然后我才注意到错误并停止了它.

I could see on WireShark that the scipt stuck on id - 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.

看到我得到了相同的结果,但比我的旧脚本快 25 到 40 倍,HTTP 请求更少,非常精确,我只错过了 1000 个好结果的 1 个结果,这是由我,朗姆酒 500 万次是不可能的,我的旧脚本运行了 30 个小时,得到了 14-15K 的结果,而我的新脚本在 5-10 分钟内给了我 960~ 的结果.

I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it's very precise, I have missed only 1 result for 1000 good results, which is find by me, it's impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.

我阅读了有关堆栈限制的信息,但必须有我尝试在 Python 中实现的算法的解决方案(我无法回到旧的算法",它永远不会结束).

I read about stack limitations, but there must be a solution for the algorithm I'm trying to implement in Python (I can't go back to my old "algorithm", it will never end).

谢谢!

推荐答案

这将递归变成一个循环:

this turns the recursion in to a loop:

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes)
                ID = ID + 16
            else:
                ID = ID + 1
        except Exception, e:
            print "somethin went wrong: " + str(e)

这篇关于Python:调用 Python 对象时超出了最大递归深度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆