urllib2.urlopen() 是否实际获取页面? [英] Does urllib2.urlopen() actually fetch the page?

查看:25
本文介绍了urllib2.urlopen() 是否实际获取页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用 urllib2.urlopen() 时,我在考虑它是否只是读取标题,还是它实际上带回了整个网页?

IE 是否真的通过 urlopen 调用或 read() 调用获取 HTML 页面?

handle = urllib2.urlopen(url)html = handle.read()

我问的原因是为了这个工作流程...

  • 我有一个网址列表(其中一些提供短网址服务)
  • 如果我以前没有看过那个网址,我只想阅读网页
  • 我需要调用 urlopen() 并使用 geturl() 获取链接指向的最终页面(在 302 重定向之后),以便我知道我是否已经抓取了它.
  • 如果我已经解析了该页面,我不想承担必须抓取 html 的开销.

谢谢!

解决方案

我刚刚用wireshark 进行了一个测试.当我调用 urllib2.urlopen( 'url-for-a-700mbyte-file') 时,只有头和一些正文数据包被立即检索.直到我调用 read() 时,大部分主体才通过网络.这与我通过阅读 httplib 模块的源代码所看到的相符.

因此,为了回答最初的问题,urlopen() 不会通过网络获取整个正文.它获取标题,通常是正文的一些.当您调用 read() 时,将获取正文的其余部分.

部分正文获取是意料之中的,因为:

  1. 除非您一次读取一个字节的 http 响应,否则无法确切知道传入标头的长度,因此无法知道在正文开始之前要读取多少字节.

  2. http 客户端无法控制服务器将多少字节捆绑到响应的每个 tcp 帧中.

在实践中,由于一些正文通常与标题一起获取,您可能会发现小正文(例如小型 html 页面)完全在 urlopen() 调用中获取.

I was condering when I use urllib2.urlopen() does it just to header reads or does it actually bring back the entire webpage?

IE does the HTML page actually get fetch on the urlopen call or the read() call?

handle = urllib2.urlopen(url)
html = handle.read()

The reason I ask is for this workflow...

  • I have a list of urls (some of them with short url services)
  • I only want to read the webpage if I haven't seen that url before
  • I need to call urlopen() and use geturl() to get the final page that link goes to (after the 302 redirects) so I know if I've crawled it yet or not.
  • I don't want to incur the overhead of having to grab the html if I've already parsed that page.

thanks!

解决方案

I just ran a test with wireshark. When I called urllib2.urlopen( 'url-for-a-700mbyte-file'), only the headers and a few packets of body were retrieved immediately. It wasn't until I called read() that the majority of the body came across the network. This matches what I see by reading the source code for the httplib module.

So, to answer the original question, urlopen() does not fetch the whole body over the network. It fetches the headers and usually some of the body. The rest of the body is fetched when you call read().

The partial body fetch is to be expected, because:

  1. Unless you read an http response one byte at a time, there is no way to know exactly how long the incoming headers will be and therefore no way to know how many bytes to read before the body starts.

  2. An http client has no control of how many bytes a server bundles into each tcp frame of a response.

In practice, since some of the body is usually fetched along with the headers, you might find that small bodies (e.g. small html pages) are fetched entirely on the urlopen() call.

这篇关于urllib2.urlopen() 是否实际获取页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆