多重处理如此缓慢 [英] Multiprocessing so slow

查看:89
本文介绍了多重处理如此缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个执行以下操作的函数:

I have a function that does the following :

  • 将文件作为输入并进行基本清理.
  • 从文件中提取所需的项目,然后将其写入pandas数据框中.
  • 数据帧最终被转换为csv并写入文件夹.

这是示例代码:

def extract_function(filename):  
   with open(filename,'r') as f:  
       input_data=f.readlines()  
   try:
     // some basic searching pattern matching extracting  
     // dataframe creation with 10 columns and then extracted values are filled in
        empty dataframe
     // finally df.to_csv()

if __name__ == '__main__':
   pool_size = multiprocessing.cpu_count()
   filenames=os.listdir("/home/Desktop/input")
   pool=multiprocessing.Pool(pool_size)
   pool.map(extract_function,filenames)
   pool.close()
   pool.join()

input 文件夹中的文件总数为4000.我使用了多重处理,因为使用 for循环正常运行程序会花费一些时间.下面是两种方法的执行时间:

The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:

正常CPU处理= 139.22秒
多处理= 18.72秒

Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds

我的系统规格为:

英特尔i5第7代,12GB内存,1TB硬盘,Ubuntu 16.04

Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04

在为4000个文件运行程序时,所有内核都已得到充分使用(每个内核平均占90%).因此,我决定增加文件大小并重复该过程.这次输入文件号从4000增加到1,20,000.但是这一次在运行代码时,cpu的使用在开始时不稳定,一段时间后利用率下降了(平均每个内核的利用率约为10%).内存利用率也很低,平均最高为4gb(剩余8gb).输入4000个文件后,立即写入csv的文件很快,因为我可以立即看到跳跃或大约1000个文件或更多文件.但是在输入了120,000个文件后,文件写入速度降低到了300个左右,并且这种降低速度呈线性增长,并且在一段时间后,文件写入速度一下子变成了50-70.一直以来,大多数公羊都是免费的.我重新启动了计算机,并尝试用相同的方法清除任何不需要的僵尸进程,但结果仍然相同.

While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.

这是什么原因? 如何对大型文件实现相同的多处理?

注意:
*每个文件大小平均约为300kb.
*正在写入的每个输出文件将约为200字节.
*文件总数为4080.因此,总大小约为1.2gb.
*相同的4080个文件被用来复制以获得1,20,000个文件.
*该程序是用于检查大量文件的多重处理的实验.

Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.

更新1

我已经在功能更强大的计算机上尝试了相同的代码.

I have tried the same code in a much more powerful machine.

Intel i7 8th gen 8700,1Tb SSHD& 60GB内存.

Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.

.文件写入比普通HDD快得多.该程序采用了:

. The file writing was much faster than in normal HDD. The program took:

  • 对于4000个文件-3.7秒
  • 对于1,20,000个文件-2分钟

在实验中的某个时间点,我获得了最快的完成时间,为84秒.在那个时间点,它给了我一致的结果,同时连续尝试了两次.认为可能是因为我已正确设置了池大小中的线程因子数,所以我重新启动并再次尝试.但是这一次要慢得多.从一个角度来看,在正常运行期间,一秒钟内将写入约3000-4000个文件,但这一次它在一秒钟内将写入600个以下的文件.在这种情况下,也完全不使用撞锤. 即使使用了多处理模块,CPU的平均利用率也只有3-7%.

Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.

推荐答案

@RolandSmith& @selbie建议,我避免了IO连续写入CSV文件的问题,方法是将其替换为数据帧并附加到它.我认为这消除了不一致之处.我按照@CoMartel的建议检查了羽毛" "paraquet" 高性能IO模块,但我认为它是用于将大文件压缩为较小的数据帧结构.附加选项不存在.

As @RolandSmith & @selbie suggested, I avoided the IO continuous write into CSV files by replacing it with data frames and appending to it. This I think cleared the inconsistencies. I checked the "feather" and "paraquet" high-performance IO modules as suggested by @CoMartel but I think it's for compressing large files into a smaller data frame structure. The appending options were not there for it.

  • 程序在第一次运行时运行缓慢. 连续运行会更快.这种行为是一致的.
  • 我已经检查了程序完成后是否正在运行一些尾随的python进程,但找不到任何进程.因此,CPU/RAM内有某种类型的缓存,可以使程序在连续运行时更快地执行.
  • The program runs slow for the first run. The successive runs will be faster. This behavior was consistent.
  • I have checked for some trailing python process running after the program completion but couldn't find any. So some kind of caching is there within the CPU/RAM which make the program execution faster for the successive runs.

用于 4000个输入文件的程序花费 72秒进行首次执行,然后平均花费 14-15秒之后的所有成功运行.

The program for 4000 input files took 72 sec for first-time execution and then an average of 14-15 sec for all successive runs after that.

  • 重新启动系统会清除这些缓存,并使程序在第一次运行时运行速度变慢.

    • Restarting the system clears those cache and causes the program to run slower for the first run.

      平均新运行时间为72秒.但是,在程序启动后立即将其终止,然后运行,终止后的第一次空运行需要 40秒.所有连续运行后平均需要14秒.

      Average fresh run time is 72 sec. But killing the program as soon as it starts and then running it took 40 sec for the first dry run after termination. The average of 14 sec after all successive runs.

      新运行期间,所有核心利用率约为 10-13%.但是,在所有成功运行之后,利用率的核心将是 100%.

      During the fresh run, all core utilization will be around 10-13%. But after all the successive runs, the core utilization will be 100%.

      经检查有1,20,000个文件,它遵循相同的模式.因此,目前,已解决了不一致问题.因此,如果需要将此类代码用作服务器,则应先进行试运行以使CPU/RAM得以缓存,然后才能开始接受API查询以获得更快的结果.

      Checked with the 1,20,000 files, it follows the same pattern. So, for now, the inconsistency is solved. So if such a code needs to be used as a server a dry run should be made for the CPU/RAM to get cached before it can start to accept API queries for faster results.

      这篇关于多重处理如此缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆