Python,使用多进程比不使用它慢 [英] Python, using multiprocess is slower than not using it

查看:134
本文介绍了Python,使用多进程比不使用它慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

花了很多时间试图将注意力集中在多处理上之后,我想到了下面的代码,它是一个基准测试:

After spending a lot of time trying to wrap my head around multiprocessing I came up with this code which is a benchmark test:

示例1:

from multiprocessing  import Process

class Alter(Process):
    def __init__(self, word):
        Process.__init__(self)
        self.word = word
        self.word2 = ''

    def run(self):
        # Alter string + test processing speed
        for i in range(80000):
            self.word2 = self.word2 + self.word

if __name__=='__main__':
    # Send a string to be altered
    thread1 = Alter('foo')
    thread2 = Alter('bar')
    thread1.start()
    thread2.start()

    # wait for both to finish

    thread1.join()
    thread2.join()

    print(thread1.word2)
    print(thread2.word2)

这将在2秒内完成(多线程时间的一半).出于好奇,我决定下一个继续运行:

This completes in 2 seconds (half the time of multithreading). Out of curiosity I decided to run this next:

示例2:

word2 = 'foo'
word3 = 'bar'

word = 'foo'
for i in range(80000):
    word2 = word2 + word

word  = 'bar'
for i in range(80000):
    word3 = word3 + word

print(word2)
print(word3)

令我震惊的是,不到半秒就跑完了!

To my horror this ran in less than half a second!

这是怎么回事?我期望多处理运行得更快-鉴于示例1被示例2拆分为两个进程,它是否应该在示例2的一半时间内完成?

What is going on here? I expected multiprocessing to run faster - shouldn't it complete in half Example 2's time given that Example 1 is Example 2 split into two processes?

考虑了Chris的反馈后,我包括了消耗大量处理时间的实际"代码,并促使我考虑进行多处理:

After considering Chris' feedback, I have included the 'actual' code consuming the most process time, and lead me to consider multiprocessing:

self.ListVar = [[13379+ strings],[13379+ strings],
                [13379+ strings],[13379+ strings]]

for b in range(len(self.ListVar)):
    self.list1 = []
    self.temp = []
    for n in range(len(self.ListVar[b])):
        if not self.ListVar[b][n] in self.temp:
            self.list1.insert(n, self.ListVar[b][n] + '(' + 
                              str(self.ListVar[b].count(self.ListVar[b][n])) +
                              ')')
           self.temp.insert(0, self.ListVar[b][n])

   self.ListVar[b] = list(self.list1)

推荐答案

ETA:既然您已经发布了代码,我可以告诉您有一种简单的方法可以更快地完成您的工作(> 100次更快).

ETA: Now that you've posted your code, I can tell you there is a simple way to do what you're doing MUCH faster (>100 times faster).

我看到您正在执行的操作是在字符串列表中的每个项目的括号中添加一个频率.您不必创建所有

I see that what you're doing is adding a frequency in parentheses to each item in a list of strings. Instead of counting all the elements each time (which, as you can confirm using cProfile, is by far the largest bottleneck in your code), you can just create a dictionary that maps from each element to its frequency. That way, you only have to go through the list twice- once to create the frequency dictionary, once to use it to add frequency.

在这里,我将展示我的新方法,对其进行计时,并使用生成的测试用例将其与旧方法进行比较.测试用例甚至显示新结果与旧结果完全相同. 注意:下面您真正需要注意的就是new_method.

Here I'll show my new method, time it, and compare it to the old method using a generated test case. The test case even shows the new result to be exactly identical to the old one. Note: All you really need to pay attention to below is the new_method.

import random
import time
import collections
import cProfile

LIST_LEN = 14000

def timefunc(f):
    t = time.time()
    f()
    return time.time() - t


def random_string(length=3):
    """Return a random string of given length"""
    return "".join([chr(random.randint(65, 90)) for i in range(length)])


class Profiler:
    def __init__(self):
        self.original = [[random_string() for i in range(LIST_LEN)]
                            for j in range(4)]

    def old_method(self):
        self.ListVar = self.original[:]
        for b in range(len(self.ListVar)):
            self.list1 = []
            self.temp = []
            for n in range(len(self.ListVar[b])):
                if not self.ListVar[b][n] in self.temp:
                    self.list1.insert(n, self.ListVar[b][n] + '(' +    str(self.ListVar[b].count(self.ListVar[b][n])) + ')')
                    self.temp.insert(0, self.ListVar[b][n])

            self.ListVar[b] = list(self.list1)
        return self.ListVar

    def new_method(self):
        self.ListVar = self.original[:]
        for i, inner_lst in enumerate(self.ListVar):
            freq_dict = collections.defaultdict(int)
            # create frequency dictionary
            for e in inner_lst:
                freq_dict[e] += 1
            temp = set()
            ret = []
            for e in inner_lst:
                if e not in temp:
                    ret.append(e + '(' + str(freq_dict[e]) + ')')
                    temp.add(e)
            self.ListVar[i] = ret
        return self.ListVar

    def time_and_confirm(self):
        """
        Time the old and new methods, and confirm they return the same value
        """
        time_a = time.time()
        l1 = self.old_method()
        time_b = time.time()
        l2 = self.new_method()
        time_c = time.time()

        # confirm that the two are the same
        assert l1 == l2, "The old and new methods don't return the same value"

        return time_b - time_a, time_c - time_b

p = Profiler()
print p.time_and_confirm()

运行此命令时,它得到的时间是(15.963812112808808,0.05961179733276367),这意味着它快了250倍,尽管这一优势取决于列表的时长和每个列表中的频率分布.我相信您会同意,凭借这种速度优势,您可能不需要使用多处理功能:)

When I run this, it gets times of (15.963812112808228, 0.05961179733276367), meaning it's about 250 times faster, though this advantage depends on both how long the lists are and the frequency distribution within each list. I'm sure you'll agree that with this speed advantage, you probably won't need to use multiprocessing :)

(我的原始答案留在后头)

(My original answer is left in below for posterity)

ETA:顺便说一句,值得注意的是,该算法在列表的长度上大致是线性的,而您使用的代码是二次的.这意味着,元素数量越多,它的优势就越明显.例如,如果将每个列表的长度增加到1000000,则只需要5秒钟即可运行.根据推断,旧代码将花费一天的时间:)

ETA: By the way, it is worth noting that this algorithm is roughly linear in the length of the lists, while the code you used is quadratic. This means it performs with even more of an advantage the larger the number of elements. For example, if you increase the length of each list to 1000000, it takes only 5 seconds to run. Based on extrapolation, the old code would take over a day :)

这取决于您正在执行的操作.例如:

It depends on the operation you are performing. For example:

import time
NUM_RANGE = 100000000

from multiprocessing  import Process

def timefunc(f):
    t = time.time()
    f()
    return time.time() - t

def multi():
    class MultiProcess(Process):
        def __init__(self):
            Process.__init__(self)

        def run(self):
            # Alter string + test processing speed
            for i in xrange(NUM_RANGE):
                a = 20 * 20

    thread1 = MultiProcess()
    thread2 = MultiProcess()
    thread1.start()
    thread2.start()
    thread1.join()
    thread2.join()

def single():
    for i in xrange(NUM_RANGE):
        a = 20 * 20

    for i in xrange(NUM_RANGE):
        a = 20 * 20

print timefunc(multi) / timefunc(single)

在我的机器上,多进程操作仅占单线程操作时间的60%.

On my machine, the multiprocessed operation takes up only ~60% the time of the singlethreaded one.

这篇关于Python,使用多进程比不使用它慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆