多重处理如此缓慢 [英] Multiprocessing so slow

查看：89 发布时间：2020/5/13 20:14:47 python pandas parallel-processing multiprocessing python-multiprocessing

本文介绍了多重处理如此缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个执行以下操作的函数:

I have a function that does the following :

将文件作为输入并进行基本清理.
从文件中提取所需的项目，然后将其写入pandas数据框中.
数据帧最终被转换为csv并写入文件夹.

这是示例代码:

def extract_function(filename):  
   with open(filename,'r') as f:  
       input_data=f.readlines()  
   try:
     // some basic searching pattern matching extracting  
     // dataframe creation with 10 columns and then extracted values are filled in
        empty dataframe
     // finally df.to_csv()

if __name__ == '__main__':
   pool_size = multiprocessing.cpu_count()
   filenames=os.listdir("/home/Desktop/input")
   pool=multiprocessing.Pool(pool_size)
   pool.map(extract_function,filenames)
   pool.close()
   pool.join()

input 文件夹中的文件总数为4000.我使用了多重处理，因为使用 for循环正常运行程序会花费一些时间.下面是两种方法的执行时间:

The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:

正常CPU处理= 139.22秒
多处理= 18.72秒

Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds

我的系统规格为:

英特尔i5第7代，12GB内存，1TB硬盘，Ubuntu 16.04

Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04

在为4000个文件运行程序时，所有内核都已得到充分使用(每个内核平均占90％).因此，我决定增加文件大小并重复该过程.这次输入文件号从4000增加到1,20,000.但是这一次在运行代码时，cpu的使用在开始时不稳定，一段时间后利用率下降了(平均每个内核的利用率约为10％).内存利用率也很低，平均最高为4gb(剩余8gb).输入4000个文件后，立即写入csv的文件很快，因为我可以立即看到跳跃或大约1000个文件或更多文件.但是在输入了120,000个文件后，文件写入速度降低到了300个左右，并且这种降低速度呈线性增长，并且在一段时间后，文件写入速度一下子变成了50-70.一直以来，大多数公羊都是免费的.我重新启动了计算机，并尝试用相同的方法清除任何不需要的僵尸进程，但结果仍然相同.

While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.

这是什么原因? 如何对大型文件实现相同的多处理?

注意:
*每个文件大小平均约为300kb.
*正在写入的每个输出文件将约为200字节.
*文件总数为4080.因此，总大小约为1.2gb.
*相同的4080个文件被用来复制以获得1,20,000个文件.
*该程序是用于检查大量文件的多重处理的实验.

Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.

更新1

我已经在功能更强大的计算机上尝试了相同的代码.

I have tried the same code in a much more powerful machine.

Intel i7 8th gen 8700，1Tb SSHD& 60GB内存.

Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.

.文件写入比普通HDD快得多.该程序采用了:

. The file writing was much faster than in normal HDD. The program took:

对于4000个文件-3.7秒
对于1,20,000个文件-2分钟

在实验中的某个时间点，我获得了最快的完成时间，为84秒.在那个时间点，它给了我一致的结果，同时连续尝试了两次.认为可能是因为我已正确设置了池大小中的线程因子数，所以我重新启动并再次尝试.但是这一次要慢得多.从一个角度来看，在正常运行期间，一秒钟内将写入约3000-4000个文件，但这一次它在一秒钟内将写入600个以下的文件.在这种情况下，也完全不使用撞锤. 即使使用了多处理模块，CPU的平均利用率也只有3-7％.

Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.

多重处理如此缓慢 [英] Multiprocessing so slow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

多重处理如此缓慢 [英] Multiprocessing so slow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭