线程化 HTTP 请求(使用代理) [英] Threading HTTP requests (with proxies)

查看:30
本文介绍了线程化 HTTP 请求(使用代理)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看过类似的问题,但对于使用 HTTP 处理线程的最佳方式似乎总是存在很多分歧.

I've looked at similar questions, but there always seems to be a whole lot of disagreement over the best way to handle threading with HTTP.

我特别想做的事情:我使用的是 Python 2.7,我想尝试处理 HTTP 请求(特别是 POST 一些东西),每个请求都有一个 SOCKS5 代理.我的代码已经工作,但速度很慢,因为它在开始另一个请求之前等待每个请求(到代理服务器,然后是 Web 服务器)完成.每个线程很可能会使用不同的 SOCKS 代理发出不同的请求.

What I specifically want to do: I'm using Python 2.7, and I want to try and thread HTTP requests (specifically, POSTing something), with a SOCKS5 proxy for each. The code I have already works, but is rather slow since it's waiting for each request (to the proxy server, then the web server) to finish before starting another. Each thread would most likely be making a different request with a different SOCKS proxy.

到目前为止,我一直在使用 urllib2.我研究了像 PycURL 这样的模块,但是在 Windows 上使用 Python 2.7 正确安装非常困难,我想支持它并且我正在编码.不过,我愿意使用任何其他模块.

So far I've purely been using urllib2. I looked into modules like PycURL, but it is extremely difficult to install properly with Python 2.7 on Windows, which I want to support and which I am coding on. I'd be willing to use any other module though.

我特别研究了这些问题:

I've looked at these questions in particular:

Python urllib2.urlopen() 很慢,需要更好的方法来读取多个 url

Python - 使用 HTTPS 的 urllib2 异步/线程请求示例

许多示例都遭到了反对和争论.假设评论者是正确的,使用像 Twisted 这样的异步框架的客户端听起来似乎是最快的使用方法.然而,我在谷歌上猛烈地搜索,它没有为 SOCKS5 代理提供任何类型的支持.我目前正在使用 Socksipy 模块,我可以尝试以下操作:

Many of the examples received downvotes and arguing. Assuming the commenters are correct, making a client with an asynchronous framework like Twisted sounds like it would be the fastest thing to use. However, I Googled ferociously, and it does not provide any sort of support for SOCKS5 proxies. I'm currently using the Socksipy module, and I could try something like:

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, IP, port)
socks.wrapmodule(twisted.web.client)

我不知道这是否可行,我也不知道 Twisted 是否是我真正想要使用的.我也可以使用线程模块并将其处理到我当前的 urllib2 代码中,但如果这比 Twisted 慢得多,我可能不想打扰.有没有人有任何见解?

I have no idea if that would work though, and I also don't even know if Twisted is what I really want to use. I could also just go with the threading module and work that into my current urllib2 code, but if that is going to be much slower than Twisted, I may not want to bother. Does anyone have any insight?

推荐答案

也许更简单的方法是依赖 gevent(或 eventlet)让您打开与服务器的大量连接.这些库monkeypatch urllib 使然后异步,同时仍然让您编写同步ish 的代码.与线程相比,它们的开销较小也意味着您可以生成更多(1000 个并不罕见).

Perhaps an easier way would be to just rely on gevent (or eventlet) to let you open lots of connections to the server. These libs monkeypatch urllib to make then async, whilst still letting you write code that is sync-ish. Their smaller overhead vs threads also means you can spawn lots more (1000s would not be unusual).

我使用过类似这样的负载(抄袭自 here):

Ive used something like this loads (plagiarized from here):

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2


def print_head(url):
    print ('Starting %s' % url)
    data = urllib2.urlopen(url).read()
    print ('%s: %s bytes: %r' % (url, len(data), data[:50]))

jobs = [gevent.spawn(print_head, url) for url in urls]

这篇关于线程化 HTTP 请求(使用代理)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆