为什么这种并行搜索和替换不使用 100% 的 CPU? [英] Why does this parallel search and replace does not use 100% of CPU?

查看:66
本文介绍了为什么这种并行搜索和替换不使用 100% 的 CPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很长的推文列表(200 万),我使用正则表达式来搜索和替换这些推文中的文本.

我使用

<小时>

编辑:最后总结

答案是 joblib 同步减慢了工作进程的速度:joblib 将推文分成小块(一个接一个?)发送给工作人员,这让他们等待.使用 multiprocessing.Pool.map 和块大小为 len(tweets)/cpu_count() 使工作人员利用 100% 的 CPU.

使用joblib,运行时间约为1200万.使用多处理是 400 万.使用 multiprocessing,每个工作线程消耗大约 50mb 内存.

解决方案

玩了一会我想这是因为 joblib 把所有的时间都花在协调所有事情的并行运行上,而没有时间真正做任何事情有用的工作.至少对于我在 OSX 和 Linux 下 - 我没有任何 MS Windows 机器

我首先加载包,拉入你的代码,然后生成一个虚拟文件:

来自随机导入选择进口重新从多处理导入池从 joblib 导入延迟,并行regex = re.compile(r'a *a|b *b') #当然更复杂的IRL,有lookbehind/forwardmydict = {'aa': 'A', 'bb': 'B'}定义处理程序(匹配):返回 mydict[match[0].replace(' ', '')]def replace_in(tweet):返回 re.sub(正则表达式,处理程序,推文)例子 = [Regex 替换在计算上并不那么昂贵......不过我建议使用 Pandas,而不仅仅是一个简单的循环",嗯,我不在其他任何地方使用熊猫,但如果它使它更快,我会尝试!谢谢你的建议.关于问题:贵不贵,如果没有理由只使用 19%,它应该使用 100%"嗯,推文是生成器还是实际列表?",一个实际的字符串列表",这可能会导致主进程拥有 419MB 的内存,但是,这并不意味着该列表将被复制到其他进程,这些进程只需要在列表的切片上工作",我认为 joblib 将列表分成大致相等的块,并将这些块发送到工作进程.",也许吧,但如果你使用类似这样的代码,200 万行应该在不到一分钟的时间内完成(假设 SSD 和合理的内存速度).",我的观点是你不需要内存中的整个文件.你可以输入 tweets.txt | python replacer.py > tweets_replaced.txt,并使用操作系统的本机速度逐行替换数据","我会试试这个",不,这实际上更慢.我的代码使用 joblib.parallel 需要 1200 万,而 f_in 中的行:f_out.write(re.sub(..., line)) 需要 2100 万.关于 CPU 和内存使用:CPU 是相同的(17%) 和使用文件的内存要低得多 (60Mb).但我想尽量减少花费的时间,而不是内存使用.",因为 StackOverflow 的建议,我把它移到聊天室",我没有使用 joblib 的经验.你能用 Pandas 试试吗?pandas.pydata.org/pandas-docs/...",]使用 open('tweets.txt', 'w') 作为 fd:对于我在范围内(2_000_000):打印(选择(示例),文件=fd)

(看看你能不能猜出我是从哪里得到这些台词的!)

作为基准,我尝试使用幼稚的解决方案:

with open('tweets.txt') 作为 fin, open('tweets2.txt', 'w') 作为 fout:对于 l in fin:fout.write(replace_in(l))

这在我的 OSX 笔记本电脑上需要 14.0 秒(挂钟时间),在我的 Linux 桌面上需要 5.15 秒.请注意,将 replace_in 的定义更改为使用 do regex.sub(handler, tweet) 而不是 re.sub(regex, handler, tweet) 在我的笔记本电脑上将上述更改减少到 8.6 秒,但我不会在下面使用此更改.

然后我尝试了你的 joblib 包:

with open('tweets.txt') 作为 fin, open('tweets2.txt', 'w') 作为 fout:使用 Parallel(n_jobs=-1) 作为并行:对于 l 并行(延迟(replace_in)(tweet)对于 fin 中的推文):fout.write(l)

在我的笔记本电脑上需要 1 分 16 秒,在我的台式机上需要 34.2 秒.CPU 利用率非常低,因为子任务/工作任务大部分时间都在等待协调器向它们发送工作.

然后我尝试使用 multiprocessing 包:

with open('tweets.txt') 作为 fin, open('tweets2.txt', 'w') 作为 fout:以 Pool() 作为池:对于 pool.map(replace_in, fin, chunksize=1024) 中的 l:fout.write(l)

在我的笔记本电脑上花费了 5.95 秒,在我的台式机上花费了 2.60 秒.我还尝试使用 8 块大小,分别需要 22.1 秒和 8.29 秒.块大小允许池向其子项发送大块工作,因此它可以花更少的时间进行协调,而将更多时间用于完成有用的工作.

因此,我猜测 joblib 对于这种用法并不是特别有用,因为它似乎没有块大小的概念.

I have a very long list of tweets (2 millions) and I use regexes to search and replace text in these tweets.

I run this using a joblib.Parallel map (joblib is the parallel backend used by scikit-learn).

My problem is that I can see in Windows' Task Manager that my script does not use 100% of each CPU. It doesn't use 100% of the RAM nor the disk. So I don't understand why it won't go faster.

There is probably synchronization delays somewhere but I can't find what nor where.

The code:

# file main.py
import re
from joblib import delayed, Parallel

def make_tweets():
    tweets = load_from_file()  # this is list of strings

    regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
    mydict = {'aa': 'A', 'bb': 'B'}  

    def handler(match):
        return mydict[match[0].replace(' ', '')]

    def replace_in(tweet)
        return re.sub(regex, handler, tweet)

    # -1 mean all cores
    # I have 6 cores that can run 12 threads
    with Parallel(n_jobs=-1) as parallel:
        tweets2 = parallel(delayed(replace_in)(tweet) for tweet in tweets)

    return tweets2

And here's the Task Manager:


Edit: final word

The answer is that the worker processes were slowed down by joblib synchronization: joblib sends the tweets in small chunks (one by one?) to the workers, which makes them wait. Using multiprocessing.Pool.map with a chunksize of len(tweets)/cpu_count() made the workers utilize 100% of the CPU.

Using joblib, the running time was around 12mn. Using multiprocessing it is 4mn. With multiprocessing, Each worker thread consumed around 50mb memory.

解决方案

after a bit of playing I think it's because joblib is spending all its time coordinating parallel running of everything and no time actually doing any useful work. at least for me under OSX and Linux — I don't have any MS Windows machines

I started by loading packages, pulling in your code, and generating a dummy file:

from random import choice
import re

from multiprocessing import Pool
from joblib import delayed, Parallel

regex = re.compile(r'a *a|b *b')  # of course more complex IRL, with lookbehind/forward
mydict = {'aa': 'A', 'bb': 'B'}  

def handler(match):
    return mydict[match[0].replace(' ', '')]

def replace_in(tweet):
    return re.sub(regex, handler, tweet)

examples = [
    "Regex replace isn't that computationally expensive... I would suggest using Pandas, though, rather than just a plain loop",
    "Hmm I don't use pandas anywhere else, but if it makes it faster, I'll try! Thanks for the suggestion. Regarding the question: expensive or not, if there is no reason for it to use only 19%, it should use 100%"
    "Well, is tweets a generator, or an actual list?",
    "an actual list of strings",
    "That might be causing the main process to have the 419MB of memory, however, that doesn't mean that list will be copied over to the other processes, which only need to work over slices of the list",
    "I think joblib splits the list in roughly equal chunks and sends these chunks to the worker processes.",
    "Maybe, but if you use something like this code, 2 million lines should be done in less than a minute (assuming an SSD, and reasonable memory speeds).",
    "My point is that you don't need the whole file in memory. You could type tweets.txt | python replacer.py > tweets_replaced.txt, and use the OS's native speeds to replace data line-by-line",
    "I will try this",
    "no, this is actually slower. My code takes 12mn using joblib.parallel and for line in f_in: f_out.write(re.sub(..., line)) takes 21mn. Concerning CPU and memory usage: CPU is same (17%) and memory much lower (60Mb) using files. But I want to minimize time spent, not memory usage.",
    "I moved this to chat because StackOverflow suggested it",
    "I don't have experience with joblib. Could you try the same with Pandas? pandas.pydata.org/pandas-docs/…",
]

with open('tweets.txt', 'w') as fd:
    for i in range(2_000_000):
        print(choice(examples), file=fd)

(see if you can guess where I got the lines from!)

as a baseline, I tried using naive solution:

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    for l in fin:
        fout.write(replace_in(l))

this takes 14.0s (wall clock time) on my OSX laptop, and 5.15s on my Linux desktop. Note that changing your definition of replace_in to use do regex.sub(handler, tweet) instead of re.sub(regex, handler, tweet) reduces the above to 8.6s on my laptop, but I'll not use this change below.

I then tried your joblib package:

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Parallel(n_jobs=-1) as parallel:
        for l in parallel(delayed(replace_in)(tweet) for tweet in fin):
            fout.write(l)

which takes 1min 16s on my laptop, and 34.2s on my desktop. CPU utilisation was pretty low as the child/worker tasks were all waiting for the coordinator to send them work most of the time.

I then tried using the multiprocessing package:

with open('tweets.txt') as fin, open('tweets2.txt', 'w') as fout:
    with Pool() as pool:
        for l in pool.map(replace_in, fin, chunksize=1024):
            fout.write(l)

which took 5.95s on my laptop and 2.60s on my desktop. I also tried with a chunk size of 8 which took 22.1s and 8.29s respectively. the chunk size allows the pool to send large chunks of work to its children, so it can spend less time coordinating and more time getting useful work done.

I'd therefore hazard a guess that joblib isn't particularly useful for this sort of usage as it doesn't seem to have a notion of chunksize.

这篇关于为什么这种并行搜索和替换不使用 100% 的 CPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆