Python的:最大递归深度时,超过调用Python对象 [英] Python: maximum recursion depth exceeded while calling a Python object

查看:992
本文介绍了Python的:最大递归深度时,超过调用Python对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经建立了一个履带式的不得不对约500万页运行(通过增加URL ID),然后分析其中包含的信息:我需要的页面。

I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need.

用算法中的URL(200K)上运行,并保存我发现我浪费了很多时间的好和坏的结果之后。我可以看到有一些机管局返回subtrahends,我可以用它来检查下一个有效的URL。

after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.

您可以看到subtrahends相当快(几首好标识中的一个小前') -

you can see the subtrahends quite fast (a little ex' of the few first "good IDs") -

510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201

爬这给了我只有14K的好成绩,我知道我是在浪费我的时间和需要优化它,所以我跑了一些统计数据和建立一个函数,将检查URL的同时增加了ID 8 \ 18 200K左右的网址后, \ 17 \ 8(顶部返回subtrahends)等。

after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 8\18\17\8 (top returning subtrahends ) etc'.

这是函数 -

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                checkNextID(ID + 8)
                return 0
            if isValid(ID + 18):
                parseHTML(curRes)
                checkNextID(ID + 18)
                return 0
            if isValid(ID + 7):
                parseHTML(curRes)
                checkNextID(ID + 7)
                return 0
            if isValid(ID + 17):
                parseHTML(curRes)
                checkNextID(ID + 17)
                return 0
            if isValid(ID+6):
                parseHTML(curRes)
                checkNextID(ID + 6)
                return 0
            if isValid(ID + 16):
                parseHTML(curRes)
                checkNextID(ID + 16)
                return 0
            else:
                checkNextID(ID + 1)
                return 0
        except Exception, e:
            print "somethin went wrong: " + str(e)

什么基本上做的是-checkNextID(ID)是获得第一个ID我知道,包含的数据减去8所以第一迭代将匹配第一个如果的isValid条款(参考isValid(ID + 8)将返回true)

what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first "if isValid" clause (isValid(ID + 8) will return True).

的lastResult 是保存最后一个已知URL标识的变量,因此我们将持续到numOfRuns是

lastResult is a variable which saves the last known url id, so we'll run until numOfRuns is

的isValid()是一个函数,得到一个ID +一个subtrahends,并返回​​True如果URL中包含什么,我需要和保存URL到一个名为全球varibale的汤对象 - curRes ,它如果URL中不包含我所需要的数据返回False。

isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named - 'curRes', it returns False if the url doesn't contain the data I need.

parseHTML 是一个函数,得到汤对象(curRes),分析我所需要的数据,然后将数据保存到CSV,然后返回True。

parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.

如果isValid()的返回True,我们会打电话给parseHTML(),然后尝试检查下一个ID +的subtrahends(通过调用checkNextID(ID + subtrahends),如果他们没有将返回我在寻找什么我会增加它与1和再次检查,直到我找到下一个有效的URL。

if isValid() returns True, we'll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I'm looking for I'll increase it with 1 and check again until I'll find the next valid url.

您可以看到code 其余

you can see the rest of the code here

运行code后,我得到了约950〜好成绩,突然一个异常募 -

after running the code I got about 950~ good results and suddenly an exception had raised -

事端出了问题:最大递归深度时,超过调用   Python对象

"somethin went wrong: maximum recursion depth exceeded while calling a Python object"

我可以在Wireshark的看到素文字贴在ID - 510009541(我开始了我的脚本510000003),脚本试图获得URL与ID的前几次,我注意到的错误,并停止了它

I could see on WireShark that the scipt stuck on id - 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.

我真的很高兴看到我得到了相同的结果,但25X-40X倍的速度那么我的旧脚本,用更少的HTTP请求,这是非常precise,我已经错过了只有1个结果1000的好成绩,这被我发现,这是不可能的朗姆酒500万次,我有我的旧脚本运行30小时,得到了14-15K结果时,我的新剧本给了我960〜结果5-10分钟。

I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it's very precise, I have missed only 1 result for 1000 good results, which is find by me, it's impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.

我读到堆栈的限制,但必须有算法我想在Python实现(我不能回到我的老的算法的,它永远不会解决方案结束)。

I read about stack limitations, but there must be a solution for the algorithm I'm trying to implement in Python (I can't go back to my old "algorithm", it will never end).

谢谢!

推荐答案

在一个循环这原来的递归:

this turns the recursion in to a loop:

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8):
                parseHTML(curRes)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes)
                ID = ID + 16
            else:
                ID = ID + 1
        except Exception, e:
            print "somethin went wrong: " + str(e)

这篇关于Python的:最大递归深度时,超过调用Python对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆