数据流管道缓慢 [英] Dataflow Pipeline Slow
问题描述
这些工作人员都是n1标准的,而且所有工作都在100%的CPU使用率。
组合中包含的操作包括:
$ b $ ol
每条记录约为30-70 KB。
BigQuery记录的总数为〜9,000,000(日志显示所有记录已成功导出)
使用30个工作线程,我预计吞吐量为比我的单一本地机器高很多,当然也不会快一半。
问题是什么?
经过一些性能分析和测试多个大小的数据集后,很可能是字典庞大的问题。即管道工作良好,数以千计(吞吐量接近40个元素/秒),但数百万人中断。我正在关闭这个话题,因为兔子洞越来越深。
由于这是一个特定用例的问题,因此我认为在这个线程上继续是无关紧要的。如果你想在我的冒险中跟随我,后续问题驻留在此处
My Dataflow pipeline is running extremely slow. Its processing approximately 4 elements/2 with 30 worker threads. A single local machine running the same operations (but not in the dataflow framework) is able to process 7 elements/s. The script is written in Python. Data is read from BigQuery.
The workers are n1-standard, and all look to be at 100% CPU utilization.
The operations contained within the combine are:
- tokenizes the record and applies stop word filtering (nltk)
- stem the word (nltk)
- lookup the word in a dictionary
- increment the count of said word in a dictionary
Each record is approximately 30-70 KB. Total number of records is ~ 9,000,000 from BigQuery (Log shows all records have been exported successfully)
With 30 worker threads, I expect the throughput to be a lot higher than my single local machine and certainly not half as fast.
What could the problem be?
After some performance profiling and testing multiple sized datasets, it appears that this is probably a problem with the huge size of the dictionaries. i.e. The pipeline works fine for thousands (with throughput closer to 40 elements/s), but breaks on millions. I'm closing this topic, as the rabbit hole goes deeper.
Since this is a problem with a specific use case, I thought it would not be relevant to continue on this thread. If you want to follow me on my adventure, the followup questions reside here
这篇关于数据流管道缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!