如何使用多个工作人员加速批量导入谷歌云数据存储? [英] How to speedup bulk importing into google cloud datastore with multiple workers?

本文介绍了如何使用多个工作人员加速批量导入谷歌云数据存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于 apache-beam 的数据流作业要使用 .我怀疑(如果我理解正确的话)当输入是单个文件时我总是得到一个工人的原因是由于 中还有一个相关的方法,其中vcfio 已显着扩展.它似乎支持使用单个输入 vcf 文件的多个工人.我希望该项目中的更改也可以合并到梁项目中.

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:

  • The write speed into datastore is at most around 25-30 entities per second.
  • I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").

I did not use ancestor path for the keys; all the entities are the same kind.

The pipeline code looks like below:

def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
  with beam.Pipeline(options=pipeline_options) as p:
  (p
   | 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
   | 'Create my entity' >> beam.ParDo(
     ToEntityFn(), user_options.kind)
   | 'Write to datastore' >> WriteToDatastore(project))

Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.

Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!

解决方案

I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).

One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

这篇关于如何使用多个工作人员加速批量导入谷歌云数据存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆