使用httplib的IncompleteRead [英] IncompleteRead using httplib

查看:125
本文介绍了使用httplib的IncompleteRead的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直遇到从特定网站获取rss供稿的问题.我写了一个丑陋的程序来执行此功能,但是我很好奇为什么会发生这种情况,以及是否有任何更高级别的接口能够正确处理此问题.这个问题并不是真正的问题,因为我不需要经常检索提要.

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.

我已经阅读了一个捕获异常并返回部分内容的解决方案,但是由于不完整的读取在实际获取的字节数方面有所不同,因此我不确定这种解决方案是否会真正起作用.

I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.

#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead

url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'

content = feedparser.parse(url)
if 'bozo_exception' in content:
    print content['bozo_exception']
else:
    print "Success!!"
    sys.exit(0)

print "If you see this, please tell me what happened."

# try using mechanize
b = Browser()
r = b.open(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using mechanize", e

# try using urllib2
r = urllib2.urlopen(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using urllib2", e


# try using requests
try:
    r = requests.request('GET', url)
except IncompleteRead, e:
    print "IncompleteRead using requests", e

# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to 
# learn what's happening.  Please help me put this function into
# eternal rest.
def get_rss_feed(url):
    response = urllib2.urlopen(url)
    read_it = True
    content = ''
    while read_it:
        try:
            content += response.read(1)
        except IncompleteRead:
            read_it = False
    return content, response.info()


content, info = get_rss_feed(url)

feed = feedparser.parse(content)

如前所述,这不是一个关键任务问题,而是一个好奇心,即使我可以期望urllib2出现此问题,但我也很惊讶在机械化和请求中也遇到此错误. feedparser模块甚至不会引发错误,因此检查错误取决于'bozo_exception'键的存在.

As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.

我只是想提到wget和curl都完美地执行了该功能,每次都正确地检索了完整的有效负载.除了我的丑陋外,我还没有找到一种可以工作的纯python方法,而且我很想知道httplib后端发生了什么.百搭的我决定在前几天也用斜纹布尝试一下,并得到相同的httplib错误.

I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.

P.S.有一件事情也让我感到非常奇怪. IncompleteRead始终在有效负载中的两个断点之一处发生.似乎feedparser和请求在读取926个字节后失败,但是机械化和urllib2在读取1854个字节后失败.这种行为是偶然的,我没有任何解释或理解.

P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.

推荐答案

在一天结束时,所有其他模块(feedparsermechanizeurllib2)都调用httplib会引发异常.

At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.

现在,首先,我还使用wget下载了此文件,结果文件为1854字节.接下来,我尝试使用urllib2:

Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:

>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
 'Content-Type: text/xml; charset=utf-8\r\n',
 'Server: Microsoft-IIS/7.5\r\n',
 'X-AspNet-Version: 4.0.30319\r\n',
 'X-Powered-By: ASP.NET\r\n',
 'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
 'Via: 1.1 BC1-ACLD\r\n',
 'Transfer-Encoding: chunked\r\n',
 'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)

因此,它正在读取所有1854个字节,但随后认为还会有更多.如果我们明确地告诉它只能读取1854个字节,那么它将起作用:

So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:

>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

很明显,这只有在我们总是提前知道确切长度的情况下才有用.我们可以使用以下事实:部分读取作为异常的属性返回,以捕获全部内容:

Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:

>>> try:
...     contents = f.read()
... except httplib.IncompleteRead as e:
...     contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

此博客文章暗示这是服务器,并说明如何使用上面的try..except块对httplib.HTTPResponse.read()方法进行猴子补丁处理,以处理幕后的事情:

This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:

import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial

    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

我应用了补丁,然后feedparser起作用了:

I applied the patch and then feedparser worked:

>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
 'encoding': 'utf-8',
 'entries': ...
 'status': 200,
 'version': 'rss20'}

这不是做事的最好方法,但似乎可行.我对HTTP协议不够专业,无法确定服务器是在做错事情,还是httplib在处理边缘情况时都做错了.

This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.

这篇关于使用httplib的IncompleteRead的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆