使用Python Dask包会降低性能吗？ [英] Slow Performance with Python Dask bag?

查看：114 发布时间：2020/10/15 18:42:08 python performance dask

本文介绍了使用Python Dask包会降低性能吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在测试dask.bag的一些测试，以准备处理数百万个文本文件的大型文本处理工作。现在，在我的数十至数十万个文本文件的测试集上，我发现dask的运行速度比直接的单线程文本处理功能慢5到6倍。

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.

有人可以解释一下在大量文本文件上运行dask带来的速度优势吗？在开始变得更快之前，我必须处理多少个文件？ 150,000个小的文本文件太少了吗？在处理文件时，我应该调整哪种性能参数以加快速度？与纯单线程文本处理相比，性能下降了5倍的原因是什么？

Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?

这是我用来测试代码的示例。这与来自路透社的测试数据集有关，该数据集位于：

Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:

http://www.daviddlewis.com/resources/testcollections/reuters21578/

此数据并不完全相同作为我要处理的数据。在其他情况下，则是一堆单独的文本文件，每个文件一个文档，但是我看到的性能下降大致相同。代码如下：

This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:

import dask.bag as db
from collections import Counter
import string
import glob
import datetime

my_files = "./reuters/*.ascii"

def single_threaded_text_processor():
    c = Counter()
    for my_file in glob.glob(my_files):
        with open(my_file, "r") as f:
            d = f.read()
            c.update(d.split())
    return(c)

start = datetime.datetime.now()
print(single_threaded_text_processor().most_common(5))
print(str(datetime.datetime.now() - start))

start = datetime.datetime.now()
b = db.read_text(my_files)
wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1])
print(str([w for w in wordcount]))
print(str(datetime.datetime.now() - start))

这是我的结果：

[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)]
0:00:02.958721
[(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)]
0:00:17.877077

使用Python Dask包会降低性能吗？ [英] Slow Performance with Python Dask bag?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python Dask包会降低性能吗？ [英] Slow Performance with Python Dask bag?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭