为什么同时读取多个文件比按顺序读取要慢? [英] Why is reading multiple files at the same time slower than reading sequentially?

查看:139
本文介绍了为什么同时读取多个文件比按顺序读取要慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析目录中找到的许多文件,但是使用多处理会减慢我的程序。

I am trying to parse many files found in a directory, however using multiprocessing slows my program.

# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
                                                       combined ~100MB

以下示例来自python文档:

Following this example from python documentation:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

我写了这段代码:

from multiprocessing import Pool
from api.ttypes import *

import gc
import os

def _parse(pathToFile):
    myList = []
    with open(pathToFile) as f:
        for line in f:
            s = line.split()
            x, y = [int(v) for v in s]
            obj = CoresetPoint(x, y)
            gc.disable()
            myList.append(obj)
            gc.enable()
    return Points(myList)

def getParsedFiles(pathToFile):
    myList = []
    p = Pool(2)
    for filename in os.listdir(pathToFile):
        if filename.endswith(".txt"):
            myList.append(filename)
    return p.map(_pars, , myList)

我按照这个例子,把所有文件的名字都结束了在列表中使用 .txt ,然后创建池,并将它们映射到我的函数。然后我想返回一个对象列表。每个对象保存文件的已解析数据。但令我惊讶的是,我得到了以下结果:

I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:

#Pool 32  ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)

图表:

Graph:

机器规格:

62.8 GiB RAM
Intel® Core™ i7-6850K CPU @ 3.60GHz × 12   

我在这里缺少什么?

提前致谢!

What am I missing here ?
Thanks in advance!

推荐答案

看起来你是 I / O绑定


在计算机科学中,I / O bound指的是完成计算所花费的时间主要取决于等待输入/输出操作所花费的时间的条件要完成的。这与CPU绑定的任务相反。当请求数据的速率低于其消耗速率时,或者换句话说,请求数据花费的时间多于处理数据的时间,就会出现这种情况。

In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

当子进程可用时,您可能需要让主线程执行读取并将数据添加到池中。这与使用 map 不同。

You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.

当您一次处理一行时,输入是拆分,您可以使用 fileinput 迭代多个文件的行,并映射到函数处理行而不是文件:

As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:

一次传递一行可能太慢,所以我们可以要求地图传递块,并可以调整,直到我们找到一个甜点。我们的函数解析行的行:

Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:

def _parse_coreset_points(lines):
    return Points([_parse_coreset_point(line) for line in lines])

def _parse_coreset_point(line):
    s = line.split()
    x, y = [int(v) for v in s]
    return CoresetPoint(x, y)

我们的主要功能:

import fileinput

def getParsedFiles(directory):
    pool = Pool(2)

    txts = [filename for filename in os.listdir(directory):
            if filename.endswith(".txt")]

    return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)

这篇关于为什么同时读取多个文件比按顺序读取要慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆