IO 任务中的 Python 多线程没有好处? [英] No benefit from Python Multi-threading in IO task?

查看：79 发布时间：2021/6/4 19:55:43 python multithreading python-3.x multiprocessing

本文介绍了IO 任务中的 Python 多线程没有好处?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在 python 中读取数千小时的 wav 文件并获取它们的持续时间.这基本上需要打开 wav 文件，获取帧数并考虑采样率.下面是代码:

I am trying to read several thousands of hours of wav files in python and get their duration. This essentially requires opening the wav file, getting the number of frames and factoring in the sampling rate. Below is the code for that:

def wav_duration(file_name):
    wv = wave.open(file_name, 'r')
    nframes = wv.getnframes()
    samp_rate = wv.getframerate()
    duration = nframes / samp_rate
    wv.close()
    return duration


def build_datum(wav_file):
    key = "/".join(wav_file.split('/')[-3:])[:-4]
    try:
        datum = {"wav_file" : wav_file,
                "labels"    : all_labels[key],
                "duration"  : wav_duration(wav_file)}

        return datum
    except KeyError:
        return "key_error"
    except:
        return "wav_error"

按顺序执行此操作将花费太长时间.我的理解是多线程应该在这里有所帮助，因为它本质上是一个 IO 任务.因此，我就是这样做的:

Doing this sequentially will take too long. My understanding was that multi-threading should help here since it is essentially an IO task. Hence, I do just that:

all_wav_files = all_wav_files[:1000000]
data, key_errors, wav_errors = list(), list(), list()

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
    # submit jobs and get the mapping from futures to wav_file
    future2wav = {executor.submit(build_datum, wav_file): wav_file for wav_file in all_wav_files}
    for future in concurrent.futures.as_completed(future2wav):
        wav_file = future2wav[future]
        try:
            datum = future.result()
            if datum == "key_error":
                key_errors.append(wav_file)
            elif datum == "wav_error":
                wav_errors.append(wav_file)
            else:
                data.append(datum)
        except:
            print("Generated exception from thread processing: {}".format(wav_file))

print("Time : {}".format(time.time() - start))

令我沮丧的是，我得到了以下结果(以秒为单位):

To my dismay, I however get the following results (in seconds):

Num threads | 100k wavs | 1M wavs
1           | 4.5       | 39.5
2           | 6.8       | 54.77
10          | 9.5       | 64.14
100         | 9.07      | 68.55

这是预期的吗?这是 CPU 密集型任务吗?多处理有帮助吗?我怎样才能加快速度?我正在从本地驱动器读取文件，这是在 Jupyter 笔记本上运行的.Python 3.5.

Is this expected? Is this a CPU intensive task? Will Multi-Processing help? How can I speed things up? I am reading files from the local drive and this is running on a Jupyter notebook. Python 3.5.

编辑:我知道 GIL.我只是假设打开和关闭文件本质上是 IO.人们分析表明，在IO情况下，可能是反的使用多处理富有成效.因此我决定改用多处理.

EDIT: I am aware of GIL. I just assumed that opening and closing a file is essentially IO. People's analysis have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead.

我想现在的问题是:这个任务是否受 IO 限制?

EDIT EDIT:对于那些想知道的人，我认为它受 CPU 限制(一个内核已达到 100%).这里的教训是不要对任务做出假设并自行检查.

EDIT EDIT: For those wondering, I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself.

IO 任务中的 Python 多线程没有好处? [英] No benefit from Python Multi-threading in IO task?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

IO 任务中的 Python 多线程没有好处? [英] No benefit from Python Multi-threading in IO task?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭