这两种在Python中建立网络连接的方式之间的实际区别是什么? [英] What is the practical difference between these two ways of making web connections in Python?
问题描述
我注意到有几种方法可以启动HTTP连接进行Web抓取.我不确定某些编码方式是最新的还是最新的,或者它们是否只是具有不同优缺点的不同模块.更具体地说,我试图了解以下两种方法之间的区别,您会提出什么建议?
I have notice there are several ways to iniciate http connections for web scrapping. I am not sure if some are more recent and up-to-date ways of coding, or if they are just different modules with different advantages and disadvantages. More specifically, I am trying to understand what are the differences between the following two approaches, and what would you reccomend?
1)使用urllib3:
http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")
2)使用请求
html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")
除了简单的事实,即这两个选项需要导入不同的模块之外,还有什么使它们分开的?
What sets these two options apart, besides the simple fact that they require importing different modules?
推荐答案
在内部,requests
使用urllib3
来完成大多数http繁重的工作.正确使用后,除非您需要更高级的配置,否则它应该基本相同.
Under the hood, requests
uses urllib3
to do most of the http heavy lifting. When used properly, it should be mostly the same unless you need more advanced configuration.
除了,在您的特定示例中,它们不相同:
Except, in your particular example they're not the same:
在urllib3示例中,您正在重新使用连接,而在请求示例中,您没有在重新使用连接.这是您可以知道的方法:
In the urllib3 example, you're re-using connections whereas in the requests example you're not re-using connections. Here's how you can tell:
>>> import requests
>>> requests.packages.urllib3.add_stderr_logger()
2016-04-29 11:43:42,086 DEBUG Added a stderr logging handler to logger: requests.packages.urllib3
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,043 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,158 DEBUG "GET / HTTP/1.1" 200 None
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,815 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,925 DEBUG "GET / HTTP/1.1" 200 None
要开始像在urllib3 PoolManager中那样重新使用连接,您需要发出一个 session 请求.
To start re-using connections like in a urllib3 PoolManager, you need to make a requests session.
>>> session = requests.session()
>>> session.get('https://www.google.com/')
2016-04-29 11:46:49,649 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:46:49,771 DEBUG "GET / HTTP/1.1" 200 None
>>> session.get('https://www.google.com/')
2016-04-29 11:46:50,548 DEBUG "GET / HTTP/1.1" 200 None
现在等效于您使用http = PoolManager()
所做的事情.另一个注意事项:urllib3是一个较低级别的更显式的库,因此您显式创建一个池,并且需要显式地指定
Now it's equivalent to what you were doing with http = PoolManager()
. One more note: urllib3 is a lower-level more explicit library, so you explicitly create a pool and you'll explicitly need to specify your SSL certificate location, for example. It's an extra line or two of more work but also a fair bit more control if that's what you're looking for.
说完了,比较就变成了:
All said and done, the comparison becomes:
1)使用urllib3:
import urllib3, certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
html = http.request('GET', url).read()
soup = BeautifulSoup(html, "html5lib")
2)使用请求:
import requests
session = requests.session()
html = session.get(url).content
soup = BeautifulSoup(html, "html5lib")
这篇关于这两种在Python中建立网络连接的方式之间的实际区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!