如何使我的程序在python中使用系统的多个内核? [英] How can I make my program to use multiple cores of my system in python?

查看:102
本文介绍了如何使我的程序在python中使用系统的多个内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我拥有的所有内核上运行我的程序.这是我在程序中使用的下面的代码(它是完整程序的一部分.以某种方式,设法编写了工作流程).

I wanted to run my program on all the cores that I have. Here is the code below which I used in my program(which is a part of my full program. somehow, managed to write the working flow).

def ssmake(data):
    sslist=[]
    for cols in data.columns:
        sslist.append(cols)
    return sslist

def scorecal(slisted):
    subspaceScoresList=[]
    if __name__ == '__main__':
        pool = mp.Pool(4)
            feature,FinalsubSpaceScore = pool.map(performDBScan, ssList)
            subspaceScoresList.append([feature, FinalsubSpaceScore])

        #for feature in ssList:
            #FinalsubSpaceScore = performDBScan(feature)
            #subspaceScoresList.append([feature,FinalsubSpaceScore])
        return subspaceScoresList

def performDBScan(subspace):
    minpoi=2
    Epsj=2
    final_data = df[subspace]
    db = DBSCAN(eps=Epsj, min_samples=minpoi, metric='euclidean').fit(final_data)
        labels = db.labels_
    FScore = calculateSScore(labels)
    return subspace, FScore

def calculateSScore(cluresult):
    score = random.randint(1,21)*5
    return score

def StartingFunction(prvscore,curscore,fe_select,df):
    while prvscore<=curscore:
        featurelist=ssmake(df)
        scorelist=scorecal(featurelist)

a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 'c' : ['dog', 'cat', 'tree','slow','fast','hurry']}
df2 = pd.DataFrame(a)
previous=0
current=0
dim=[]
StartingFunction(previous,current,dim,df2)

我在scorecal(slisted)方法中有一个for循环,该循环已被注释,需要每一列来执行DBSCAN,并且必须根据结果计算该特定列的得分(但是我尝试在此处使用随机得分例子).这种循环使我的代码可以运行更长的时间.因此,我尝试并行化DataFrame的每一列,以在我系统上具有的内核上执行DBSCAN,并以上述方式编写了代码,但并没有得到所需的结果.我是这个多处理库的新手.我不确定'__main__'在程序中的位置.我也想知道python中是否还有其他方式可以并行运行.感谢您的帮助.

I had a for loop in scorecal(slisted) method which was commented, takes each column to perform DBSCAN and has to calculate the score for that particular column based on the result(but I tried using random score here in example). This looping is making my code to run for a longer time. So I tried to parallelize each column of the DataFrame to perform DBSCAN on the cores that i had on my system and wrote the code in the above fashion which is not giving the result that i need. I was new to this multiprocessing library. I was not sure with the placement of '__main__' in my program. I also would like to know if there is any other way in python to run in a parallel fashion. Any help is appreciated.

推荐答案

您的代码具有使用多个内核在多核处理器上运行所需的全部功能.但这是一团糟.我不知道您尝试使用代码解决什么问题.我也无法运行它,因为我不知道什么是DBSCAN.要修复您的代码,您应该执行几个步骤.

Your code has all what is needed to run on multi-core processor using more than one core. But it is a mess. I don't know what problem you trying to solve with the code. Also I cannot run it since I don't know what is DBSCAN. To fix your code you should do several steps.

功能scorecal():

def scorecal(feature_list):
    pool = mp.Pool(4)
    result = pool.map(performDBScan, feature_list)
    return result

result是一个包含performDBSCAN()返回的所有结果的列表.您不必手动填充列表.

result is a list containing all the results returned by performDBSCAN(). You don't have to populate the list manually.

程序主体:

# imports

# functions

if __name__ == '__main__':
    # your code after functions' definition where you call StartingFunction()

我创建了非常简化的代码版本(具有4个进程的池来处理我的8列数据),并使用了虚拟for循环(以实现cpu绑定操作),并对其进行了尝试.我获得了100%的CPU负载(我拥有4核i5处理器),与通过for循环实现单进程实现相比,自然可以使计算速度提高大约4倍(20秒vs 74秒).

I created very simplified version of your code (pool with 4 processes to handle 8 columns of my data) with dummy for loops (to achieve cpu-bound operation) and tried it. I got 100% cpu load (I have 4-core i5 processor) that naturally resulted in approx x4 faster computation (20 seconds vs 74 seconds) in comparison with single process implementation through for loop.

编辑.

我用来尝试多处理的完整代码(我使用Anaconda(Spyder)/Python 3.6.5/Win10):

The complete code I used to try multiprocessing (I use Anaconda (Spyder) / Python 3.6.5 / Win10):

import multiprocessing as mp
import pandas as pd
import time


def ssmake():
    pass


def score_cal(data):
    if True:
        pool = mp.Pool(4)
        result = pool.map(
            perform_dbscan,
            (data.loc[:, col] for col in data.columns))
    else:
        result = list()
        for col in data.columns:
            result.append(perform_dbscan(data.loc[:, col]))
    return result


def perform_dbscan(data):
    assert isinstance(data, pd.Series)
    for dummy in range(5 * 10 ** 8):
        dummy += 0
    return data.name, 101


def calculate_score():
    pass


def starting_function(data):
    print(score_cal(data))


if __name__ == '__main__':

    data = {
        'a': [1, 2, 3, 1, 2, 3],
        'b': [5, 6, 7, 4, 6, 5],
        'c': ['dog', 'cat', 'tree', 'slow', 'fast', 'hurry'],
        'd': [1, 1, 1, 1, 1, 1]}
    data = pd.DataFrame(data)

    start = time.time()
    starting_function(data)
    print(
        'running time = {:.2f} s'
        .format(time.time() - start))

这篇关于如何使我的程序在python中使用系统的多个内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆