Python grequests 需要很长时间才能完成 [英] Python grequests takes a long time to finish

查看:28
本文介绍了Python grequests 需要很长时间才能完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试取消我在 urlSet 中的很多 URL.以下代码大部分时间都有效.但有时需要很长时间才能完成.例如,我在 urlSet 中有 2950.stderr 告诉我 2900 已完成,但 getUrlMapping 未完成.

I am trying to unshort a lot of URLs which I have in a urlSet. The following code works most of the time. But some times it takes a very long time to finish. For example I have 2950 in urlSet. stderr tells me that 2900 is done, but getUrlMapping does not finish.

def getUrlMapping(urlSet):
# get the url mapping
urlMapping = {}
#rs = (grequests.get(u) for u in urlSet)
rs = (grequests.head(u) for u in urlSet)
res = grequests.imap(rs, size = 100)
counter = 0
for x in res:
    counter += 1
    if counter % 50 == 0:
        sys.stderr.write('Doing %d url_mapping length %d 
' %(counter, len(urlMapping)))
    urlMapping[ getOriginalUrl(x) ]  =   getGoalUrl(x) 
return urlMapping

def getGoalUrl(resp):
url=''
try:
    url = resp.url
except:
    url = 'NULL'
return url

def getOriginalUrl(resp):
url=''
try:
    url = resp.history[0].url
except IndexError:
    url = resp.url
except:
    url = 'NULL'
return url

推荐答案

可能它不会帮助你,因为它已经过去了很长时间但仍然......

Probably it won't help you as it has passed a long time but still..

我在请求方面遇到了一些问题,与您遇到的问题类似.对我来说,问题是请求需要很长时间才能下载一些页面,但使用任何其他软件(浏览器、curl、wget、python 的 urllib)一切正常......

I was having some issues with Requests, similar to the ones you are having. To me the problem was that Requests took ages to download some pages, but using any other software (browsers, curl, wget, python's urllib) everything worked fine...

在浪费了很多时间之后,我注意到服务器正在发送一些无效的标头,例如,在其中一个慢"页面中,在 Content-type: text/html 之后它开始了以 Header-name : header-value 形式发送标题(注意冒号前的空格).这以某种方式破坏了 Python 的 email.header 功能,用于通过请求解析 HTTP 标头,因此 Transfer-encoding: chunked 标头没有被解析.

Afer a LOT of time wasted, I noticed that the server was sending some invalid headers, for example, in one of the "slow" pages, after Content-type: text/html it began to send header in the form Header-name : header-value (notice the space before the colon). This somehow breaks Python's email.header functionality used to parse HTTP headers by Requests so the Transfer-encoding: chunked header wasn't being parsed.

长话短说:在请求内容之前手动将 chunked 属性设置为 Response 对象的 True 解决了问题.例如:

Long story short: manually setting the chunked property to True of Response objects before asking for the content solved the issue. For example:

response = requests.get('http://my-slow-url')
print(response.text)

花了很长时间但是

response = requests.get('http://my-slow-url')
response.raw.chunked = True
print(response.text)

效果很好!

这篇关于Python grequests 需要很长时间才能完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆