数据流管道缓慢 [英] Dataflow Pipeline Slow

查看:160
本文介绍了数据流管道缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Dataflow管道运行速度非常慢。其处理大约4个元素/ 2,30个工作线程。运行相同操作(但不在数据流框架中)的单个本地机器能够处理7个元素/秒。脚本是用Python编写的。数据是从BigQuery中读取的。



这些工作人员都是n1标准的,而且所有工作都在100%的CPU使用率。



组合中包含的操作包括:
$ b $ ol

  • 标记记录并应用停用词过滤(nltk)
  • $
  • 在词典中查找单词($ l $ b

  • 增加词典中所述单词的计数
  • / li>

    每条记录约为30-70 KB。
    BigQuery记录的总数为〜9,000,000(日志显示所有记录已成功导出)

    使用30个工作线程,我预计吞吐量为比我的单一本地机器高很多,当然也不会快一半。



    问题是什么?




    解决方案

    经过一些性能分析和测试多个大小的数据集后,很可能是字典庞大的问题。即管道工作良好,数以千计(吞吐量接近40个元素/秒),但数百万人中断。我正在关闭这个话题,因为兔子洞越来越深。

    由于这是一个特定用例的问题,因此我认为在这个线程上继续是无关紧要的。如果你想在我的冒险中跟随我,后续问题驻留在此处


    My Dataflow pipeline is running extremely slow. Its processing approximately 4 elements/2 with 30 worker threads. A single local machine running the same operations (but not in the dataflow framework) is able to process 7 elements/s. The script is written in Python. Data is read from BigQuery.

    The workers are n1-standard, and all look to be at 100% CPU utilization.

    The operations contained within the combine are:

    1. tokenizes the record and applies stop word filtering (nltk)
    2. stem the word (nltk)
    3. lookup the word in a dictionary
    4. increment the count of said word in a dictionary

    Each record is approximately 30-70 KB. Total number of records is ~ 9,000,000 from BigQuery (Log shows all records have been exported successfully)

    With 30 worker threads, I expect the throughput to be a lot higher than my single local machine and certainly not half as fast.

    What could the problem be?

    解决方案

    After some performance profiling and testing multiple sized datasets, it appears that this is probably a problem with the huge size of the dictionaries. i.e. The pipeline works fine for thousands (with throughput closer to 40 elements/s), but breaks on millions. I'm closing this topic, as the rabbit hole goes deeper.

    Since this is a problem with a specific use case, I thought it would not be relevant to continue on this thread. If you want to follow me on my adventure, the followup questions reside here

    这篇关于数据流管道缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆