在python中使用多线程时如何获得更快的速度 [英] How to get a faster speed when using multi-threading in python

查看:61
本文介绍了在python中使用多线程时如何获得更快的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我正在研究如何尽快从网站获取数据.为了获得更快的速度,即时通讯正在考虑使用多线程.这是我用来测试多线程和简单文章之间差异的代码.

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

正如您所看到的,这是一个非常简单的代码.首先,我将模式设置为简单",然后得到时间间隔: 50s (也许我的速度有点慢:().然后我将模式设置为多",然后我得到时间间隔: 35 .从我可以看到,多线程实际上可以提高速度,但是结果却不如我想像的那样.我想获得更快的速度.

从调试中,我发现该程序主要在以下行阻塞:open_url = urllib2.urlopen(req, self.data),此行代码需要大量时间来发布和接收来自指定网站的数据.我想也许我可以通过添加time.sleep()并在urlopen函数中使用多线程来获得更快的速度,但是我不能这样做,因为它是python自己的函数.

如果不考虑服务器阻止发布速度的可能限制,我还可以做些什么来获得更快的速度?或我可以修改的任何其他代码?非常感谢!

解决方案

在许多情况下,python的线程不能很好地提高执行速度...有时,它会使情况变得更糟.有关更多信息,请参见 David Beazley关于全球口译员锁的PyCon2010演示/ Pycon2010 GIL幻灯片.该演示非常有用,我强烈建议所有考虑穿线的人使用它.

即使David Beazley的演讲解释了网络流量可以改善Python线程模块的调度,但您仍应使用

我同意TokenMacGuy的评论,上面的数字包括将.join()移到另一个循环.如您所见,python的多处理比线程处理快得多.


from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )

Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference between multi-threaded and simple post.

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

just as you can see, this is a very simple code. first i set the mode to "Simple", and i can get the time interval: 50s(maybe my speed is a little slow :(). then i set the mode to "Multiple", and i get the time interval: 35. from that i can see, multi-thread can actually increase the speed, but the result isnt as good as i imagine. i want to get a much faster speed.

from debugging, i found that the program mainly blocks at the line: open_url = urllib2.urlopen(req, self.data), this line of code takes a lot of time to post and receive data from the specified website. i guess maybe i can get a faster speed by adding time.sleep() and using multi-threading inside the urlopen function, but i cannot do that because its the python's own function.

if not considering the prossible limits that the server blocks the post speed, what else can i do to get the faster speed? or any other code i can modify? thx a lot!

解决方案

In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...

Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).

Running this on one of my older machines (Python 2.6.6):

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

I agree with TokenMacGuy's comment and the numbers above include moving the .join() to a different loop. As you can see, python's multiprocessing is significantly faster than threading.


from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )

这篇关于在python中使用多线程时如何获得更快的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆