在python中使用多线程时如何获得更快的速度 [英] How to get a faster speed when using multi-threading in python
问题描述
现在,我正在研究如何尽快从网站获取数据.为了获得更快的速度,即时通讯正在考虑使用多线程.这是我用来测试多线程和简单文章之间差异的代码.
import threading
import time
import urllib
import urllib2
class Post:
def __init__(self, website, data, mode):
self.website = website
self.data = data
#mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
self.mode = mode
def post(self):
#post data
req = urllib2.Request(self.website)
open_url = urllib2.urlopen(req, self.data)
if self.mode == "Multiple":
time.sleep(0.001)
#read HTMLData
HTMLData = open_url.read()
print "OK"
if __name__ == "__main__":
current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
"Simple")
#save the time before post data
origin_time = time.time()
if(current_post.mode == "Multiple"):
#multithreading POST
for i in range(0, 10):
thread = threading.Thread(target = current_post.post)
thread.start()
thread.join()
#calculate the time interval
time_interval = time.time() - origin_time
print time_interval
if(current_post.mode == "Simple"):
#simple POST
for i in range(0, 10):
current_post.post()
#calculate the time interval
time_interval = time.time() - origin_time
print time_interval
正如您所看到的,这是一个非常简单的代码.首先,我将模式设置为简单",然后得到时间间隔: 50s (也许我的速度有点慢:().然后我将模式设置为多",然后我得到时间间隔: 35 .从我可以看到,多线程实际上可以提高速度,但是结果却不如我想像的那样.我想获得更快的速度.>
从调试中,我发现该程序主要在以下行阻塞:open_url = urllib2.urlopen(req, self.data)
,此行代码需要大量时间来发布和接收来自指定网站的数据.我想也许我可以通过添加time.sleep()
并在urlopen
函数中使用多线程来获得更快的速度,但是我不能这样做,因为它是python自己的函数.
如果不考虑服务器阻止发布速度的可能限制,我还可以做些什么来获得更快的速度?或我可以修改的任何其他代码?非常感谢!
在许多情况下,python的线程不能很好地提高执行速度...有时,它会使情况变得更糟.有关更多信息,请参见 David Beazley关于全球口译员锁的PyCon2010演示/ Pycon2010 GIL幻灯片.该演示非常有用,我强烈建议所有考虑穿线的人使用它.
即使David Beazley的演讲解释了网络流量可以改善Python线程模块的调度,但您仍应使用 我同意TokenMacGuy的评论,上面的数字包括将.join()
移到另一个循环.如您所见,python的多处理比线程处理快得多.
from multiprocessing import Process
import threading
import time
import urllib
import urllib2
class Post:
def __init__(self, website, data, mode):
self.website = website
self.data = data
#mode is either:
# "Simple" (Simple POST)
# "Multiple" (Multi-thread POST)
# "Process" (Multiprocessing)
self.mode = mode
self.run_job()
def post(self):
#post data
req = urllib2.Request(self.website)
open_url = urllib2.urlopen(req, self.data)
if self.mode == "Multiple":
time.sleep(0.001)
#read HTMLData
HTMLData = open_url.read()
#print "OK"
def run_job(self):
"""This was refactored from the OP's code"""
origin_time = time.time()
if(self.mode == "Multiple"):
#multithreading POST
threads = list()
for i in range(0, 10):
thread = threading.Thread(target = self.post)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Process"):
#multiprocessing POST
processes = list()
for i in range(0, 10):
process = Process(target=self.post)
process.start()
processes.append(process)
for process in processes:
process.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Simple"):
#simple POST
for i in range(0, 10):
self.post()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
return time_interval
if __name__ == "__main__":
for method in ["Process", "Multiple", "Simple"]:
Post("http://forum.xda-developers.com/login.php",
"vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
method
)
Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference between multi-threaded and simple post.
import threading
import time
import urllib
import urllib2
class Post:
def __init__(self, website, data, mode):
self.website = website
self.data = data
#mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
self.mode = mode
def post(self):
#post data
req = urllib2.Request(self.website)
open_url = urllib2.urlopen(req, self.data)
if self.mode == "Multiple":
time.sleep(0.001)
#read HTMLData
HTMLData = open_url.read()
print "OK"
if __name__ == "__main__":
current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
"Simple")
#save the time before post data
origin_time = time.time()
if(current_post.mode == "Multiple"):
#multithreading POST
for i in range(0, 10):
thread = threading.Thread(target = current_post.post)
thread.start()
thread.join()
#calculate the time interval
time_interval = time.time() - origin_time
print time_interval
if(current_post.mode == "Simple"):
#simple POST
for i in range(0, 10):
current_post.post()
#calculate the time interval
time_interval = time.time() - origin_time
print time_interval
just as you can see, this is a very simple code. first i set the mode to "Simple", and i can get the time interval: 50s(maybe my speed is a little slow :(). then i set the mode to "Multiple", and i get the time interval: 35. from that i can see, multi-thread can actually increase the speed, but the result isnt as good as i imagine. i want to get a much faster speed.
from debugging, i found that the program mainly blocks at the line: open_url = urllib2.urlopen(req, self.data)
, this line of code takes a lot of time to post and receive data from the specified website. i guess maybe i can get a faster speed by adding time.sleep()
and using multi-threading inside the urlopen
function, but i cannot do that because its the python's own function.
if not considering the prossible limits that the server blocks the post speed, what else can i do to get the faster speed? or any other code i can modify? thx a lot!
In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...
Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).
Running this on one of my older machines (Python 2.6.6):
current_post.mode == "Process" (multiprocessing) --> 0.2609 seconds
current_post.mode == "Multiple" (threading) --> 0.3947 seconds
current_post.mode == "Simple" (serial execution) --> 1.650 seconds
I agree with TokenMacGuy's comment and the numbers above include moving the .join()
to a different loop. As you can see, python's multiprocessing is significantly faster than threading.
from multiprocessing import Process
import threading
import time
import urllib
import urllib2
class Post:
def __init__(self, website, data, mode):
self.website = website
self.data = data
#mode is either:
# "Simple" (Simple POST)
# "Multiple" (Multi-thread POST)
# "Process" (Multiprocessing)
self.mode = mode
self.run_job()
def post(self):
#post data
req = urllib2.Request(self.website)
open_url = urllib2.urlopen(req, self.data)
if self.mode == "Multiple":
time.sleep(0.001)
#read HTMLData
HTMLData = open_url.read()
#print "OK"
def run_job(self):
"""This was refactored from the OP's code"""
origin_time = time.time()
if(self.mode == "Multiple"):
#multithreading POST
threads = list()
for i in range(0, 10):
thread = threading.Thread(target = self.post)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Process"):
#multiprocessing POST
processes = list()
for i in range(0, 10):
process = Process(target=self.post)
process.start()
processes.append(process)
for process in processes:
process.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Simple"):
#simple POST
for i in range(0, 10):
self.post()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
return time_interval
if __name__ == "__main__":
for method in ["Process", "Multiple", "Simple"]:
Post("http://forum.xda-developers.com/login.php",
"vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
method
)
这篇关于在python中使用多线程时如何获得更快的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!