这两种在Python中建立网络连接的方式之间的实际区别是什么? [英] What is the practical difference between these two ways of making web connections in Python?

查看:109
本文介绍了这两种在Python中建立网络连接的方式之间的实际区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到有几种方法可以启动HTTP连接进行Web抓取.我不确定某些编码方式是最新的还是最新的,或者它们是否只是具有不同优缺点的不同模块.更具体地说,我试图了解以下两种方法之间的区别,您会提出什么建议?

I have notice there are several ways to iniciate http connections for web scrapping. I am not sure if some are more recent and up-to-date ways of coding, or if they are just different modules with different advantages and disadvantages. More specifically, I am trying to understand what are the differences between the following two approaches, and what would you reccomend?

1)使用urllib3:

http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")

2)使用请求

html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")

除了简单的事实,即这两个选项需要导入不同的模块之外,还有什么使它们分开的?

What sets these two options apart, besides the simple fact that they require importing different modules?

推荐答案

在内部,requests使用urllib3来完成大多数http繁重的工作.正确使用后,除非您需要更高级的配置,否则它应该基本相同.

Under the hood, requests uses urllib3 to do most of the http heavy lifting. When used properly, it should be mostly the same unless you need more advanced configuration.

除了,在您的特定示例中,它们不相同:

Except, in your particular example they're not the same:

在urllib3示例中,您正在重新使用连接,而在请求示例中,您没有在重新使用连接.这是您可以知道的方法:

In the urllib3 example, you're re-using connections whereas in the requests example you're not re-using connections. Here's how you can tell:

>>> import requests
>>> requests.packages.urllib3.add_stderr_logger()
2016-04-29 11:43:42,086 DEBUG Added a stderr logging handler to logger: requests.packages.urllib3
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,043 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,158 DEBUG "GET / HTTP/1.1" 200 None
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,815 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,925 DEBUG "GET / HTTP/1.1" 200 None

要开始像在urllib3 PoolManager中那样重新使用连接,您需要发出一个 session 请求.

To start re-using connections like in a urllib3 PoolManager, you need to make a requests session.

>>> session = requests.session()
>>> session.get('https://www.google.com/')
2016-04-29 11:46:49,649 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:46:49,771 DEBUG "GET / HTTP/1.1" 200 None
>>> session.get('https://www.google.com/')
2016-04-29 11:46:50,548 DEBUG "GET / HTTP/1.1" 200 None

现在等效于您使用http = PoolManager()所做的事情.另一个注意事项:urllib3是一个较低级别的更显式的库,因此您显式创建一个池,并且需要显式地指定

Now it's equivalent to what you were doing with http = PoolManager(). One more note: urllib3 is a lower-level more explicit library, so you explicitly create a pool and you'll explicitly need to specify your SSL certificate location, for example. It's an extra line or two of more work but also a fair bit more control if that's what you're looking for.

说完了,比较就变成了:

All said and done, the comparison becomes:

1)使用urllib3:

import urllib3, certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
html = http.request('GET', url).read()
soup = BeautifulSoup(html, "html5lib")

2)使用请求:

import requests
session = requests.session()
html = session.get(url).content
soup = BeautifulSoup(html, "html5lib")

这篇关于这两种在Python中建立网络连接的方式之间的实际区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆