使用urllib时出现问题 [英] A problem while using urllib

查看:70
本文介绍了使用urllib时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在使用urllib从网上抓取网址。这是工作流程

我的程序:


1.获取用户的基本网址和最大网址数

2 。调用过滤器以验证基本网址

3.阅读基本网址的来源并从href获取所有网址

属性a标签

4.调用过滤器验证每个抓取的网址

5.继续3-4,直到抓取的网址数达到限制


在过滤器中有一个这样的方法:


-------------------------- ------------------------

#检查网址是否可以连接

def filteredByConnection (self,url):

断言网址


试试:

webPage = urllib2.urlopen(url)

除了urllib2.URLError:

self.logGenerator.log(" Error:" + url +"< urlopen error timed

out>" ;)

返回False

除了urllib2.HTTPError:

self.logGenerator.log(" Error:" + url +"找不到)

返回False

self.logGenerator.log(" Connecting" + url +" successces)

webPage .close()

返回True

---------------------------- ------------------------


但是永远当我跑到70到75个网址时(这意味着70-75

网址已通过这种方式测试过),程序将崩溃并且所有的

网址left将引发urllib2.URLError直到程序退出。我尝试了很多方法来解决它,使用urllib,在过滤器中设置一个sleep(1)(我认为这是巨大的网址崩溃了程序)。但没有一个工作。

顺便说一句,如果我将程序崩溃的网址设置为基本网址,那么

程序仍会在70-75网址崩溃。我怎样才能解决这个问题?
问题?谢谢你的帮助


问候,

Johnny

解决方案

< blockquote> Johnny Lee< jo ************ @ hotmail.com>写道:

...

尝试:
webPage = urllib2.urlopen(url)
除了urllib2.URLError:
.. .webPage.close()
返回True
---------------------------------- ------------------

但是每当我跑到70到75个网址时(这意味着70-75个网址都有通过这种方式进行了测试),程序将崩溃并且所有剩下的URL将引发urllib2.URLError直到程序退出。我尝试了很多方法来解决它,使用urllib,在过滤器中设置一个sleep(1)(我认为这是大规模的网址崩溃了程序)。但没有一个工作。
顺便说一下,如果我设置程序崩溃的网址到基本网址,那么
程序仍会在70-75网址崩溃。我怎样才能解决这个问题?谢谢你的帮助




当然看起来像某个地方的资源泄漏(可能会打开一个文件

,直到你的程序同时达到某个最大限度的墙壁打开文件),

但是我不能在这里重现它(MacOSX,试过Python 2.3.5和

2.4.1)。您使用的是什么版本的Python,以及在什么平台上?

也许简单的Python升级可能会解决您的问题...

Alex




Alex Martelli写道:

Johnny Lee< jo ************ @ hotmail.com>写道:
...

尝试:
webPage = urllib2.urlopen(url)
除了urllib2.URLError:


.. 。

webPage.close()
返回True
-------------------------- --------------------------

但每当我跑到70到75个网址(这意味着70个-75
网址已经通过这种方式进行了测试),程序将崩溃并且所有的网址都会引发urllib2.URLError直到程序退出。我尝试了很多方法来解决它,使用urllib,在过滤器中设置一个sleep(1)(我认为这是大规模的网址崩溃了程序)。但没有一个工作。
顺便说一下,如果我设置程序崩溃的网址到基本网址,那么
程序仍会在70-75网址崩溃。我怎样才能解决这个问题?谢谢你的帮助



肯定看起来像某个地方的资源泄漏(可能会打开一个文件
,直到你的程序碰到一些最大同时打开文件的墙),
但我不能在这里重现它(MacOSX,尝试过Python 2.3.5和
2.4.1)。您使用的是什么版本的Python,以及在什么平台上?
也许简单的Python升级可能会解决您的问题...

Alex




感谢您提供的信息。我在WinXP的cygwin上使用2.4.1。

如果你想重现这个问题,我可以把源发给你。


这个早上我发现这是由urllib2引起的。当我使用urllib

而不是urllib2时,它不会再崩溃了。但问题是我想要抓住由FancyURLopener在

urllib.open()中处理的HTTP 404错误。所以我无法理解。


问候,

Johnny


Johnny Lee写道:

Alex Martelli写道:

Johnny Lee< jo ************ @ hotmail.com> ;写道:
...

尝试:
webPage = urllib2.urlopen(url)
除了urllib2.URLError:



...

webPage.close()
返回True
-------------- --------------------------------------

但每次都是我跑到了70到75个网址(这意味着通过这种方式测试了70-75个网址),程序将崩溃并且所有的网址都会引发urllib2.URLError直到程序退出。我尝试了很多方法来解决它,使用urllib,在过滤器中设置一个sleep(1)(我认为这是大规模的网址崩溃了程序)。但没有一个工作。
顺便说一下,如果我设置程序崩溃的网址到基本网址,那么
程序仍会在70-75网址崩溃。我怎样才能解决这个问题?谢谢你的帮助



肯定看起来像某个地方的资源泄漏(可能会打开一个文件
,直到你的程序碰到一些最大同时打开文件的墙),
但我不能在这里重现它(MacOSX,尝试过Python 2.3.5和
2.4.1)。您使用的是什么版本的Python,以及在什么平台上?
也许简单的Python升级可能会解决您的问题...

Alex



感谢您提供的信息。我在WinXP的cygwin上使用2.4.1。
如果你想重现这个问题,我可以把源发送给你。

今天早上我发现这是由于urllib2的。当我使用urllib
而不是urllib2时,它不会再崩溃了。但问题是我想要抓住由HTTP中的FancyURLopener处理的HTTP 404错误。
urllib.open()。所以我无法理解。




我正在使用那个配置,所以如果你让我拥有那个来源

我可以帮你看一下。


问候

Steve

-

Steve Holden +44 150 684 7255 +1 800 494 3119

Holden Web LLC www .holdenweb.com

PyCon TX 2006 www。 python.org/pycon/


Hi,
I was using urllib to grab urls from web. here is the work flow of
my program:

1. Get base url and max number of urls from user
2. Call filter to validate the base url
3. Read the source of the base url and grab all the urls from "href"
property of "a" tag
4. Call filter to validate every url grabbed
5. Continue 3-4 until the number of url grabbed gets the limit

In filter there is a method like this:

--------------------------------------------------
# check whether the url can be connected
def filteredByConnection(self, url):
assert url

try:
webPage = urllib2.urlopen(url)
except urllib2.URLError:
self.logGenerator.log("Error: " + url + " <urlopen error timed
out>")
return False
except urllib2.HTTPError:
self.logGenerator.log("Error: " + url + " not found")
return False
self.logGenerator.log("Connecting " + url + " successed")
webPage.close()
return True
----------------------------------------------------

But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help

Regards,
Johnny

解决方案

Johnny Lee <jo************@hotmail.com> wrote:
...

try:
webPage = urllib2.urlopen(url)
except urllib2.URLError: ... webPage.close()
return True
----------------------------------------------------

But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help



Sure looks like a resource leak somewhere (probably leaving a file open
until your program hits some wall of maximum simultaneously open files),
but I can''t reproduce it here (MacOSX, tried both Python 2.3.5 and
2.4.1). What version of Python are you using, and on what platform?
Maybe a simple Python upgrade might fix your problem...
Alex



Alex Martelli wrote:

Johnny Lee <jo************@hotmail.com> wrote:
...

try:
webPage = urllib2.urlopen(url)
except urllib2.URLError:


...

webPage.close()
return True
----------------------------------------------------

But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help



Sure looks like a resource leak somewhere (probably leaving a file open
until your program hits some wall of maximum simultaneously open files),
but I can''t reproduce it here (MacOSX, tried both Python 2.3.5 and
2.4.1). What version of Python are you using, and on what platform?
Maybe a simple Python upgrade might fix your problem...
Alex



Thanks for the info you provided. I''m using 2.4.1 on cygwin of WinXP.
If you want to reproduce the problem, I can send the source to you.

This morning I found that this is caused by urllib2. When I use urllib
instead of urllib2, it won''t crash any more. But the matters is that I
want to catch the HTTP 404 Error which is handled by FancyURLopener in
urllib.open(). So I can''t catch it.

Regards,
Johnny


Johnny Lee wrote:

Alex Martelli wrote:

Johnny Lee <jo************@hotmail.com> wrote:
...

try:
webPage = urllib2.urlopen(url)
except urllib2.URLError:



...

webPage.close()
return True
----------------------------------------------------

But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help



Sure looks like a resource leak somewhere (probably leaving a file open
until your program hits some wall of maximum simultaneously open files),
but I can''t reproduce it here (MacOSX, tried both Python 2.3.5 and
2.4.1). What version of Python are you using, and on what platform?
Maybe a simple Python upgrade might fix your problem...
Alex


Thanks for the info you provided. I''m using 2.4.1 on cygwin of WinXP.
If you want to reproduce the problem, I can send the source to you.

This morning I found that this is caused by urllib2. When I use urllib
instead of urllib2, it won''t crash any more. But the matters is that I
want to catch the HTTP 404 Error which is handled by FancyURLopener in
urllib.open(). So I can''t catch it.



I''m using exactly that configuration, so if you let me have that source
I could take a look at it for you.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/


这篇关于使用urllib时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆