Dataflow DoFn中的数据存储查询在云中运行时会降低管道速度 [英] Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

查看:126
本文介绍了Dataflow DoFn中的数据存储查询在云中运行时会降低管道速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过在DoFn步骤中查询Datastore来增强管道中的数据。
来自Class CustomClass 的对象中的字段用于对数据存储表执行查询,并使用返回的值来增强对象。



代码如下所示:

  public class EnhanceWithDataStore extends DoFn< ; CustomClass,CustomClass> {

private static Datastore datastore = DatastoreOptions.defaultInstance()。service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory()。kind(article);

@Override
public void processElement(ProcessContext c)throws Exception {

CustomClass event = c.element();

实体文章= datastore.get(articleKeyFactory.newKey(event.getArticleId()));

String articleName =;
尝试{
articleName = article.getString(articleName);
} catch(Exception e){}

CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);

c.output(增强版);
}

在本地运行时,速度很快,但运行时云,这一步显着减缓了管道。这是什么造成的?有没有解决方法或更好的方法来做到这一点?

可以在这里找到流水线的图片(最后一步是增强步骤):
管道架构

解决方案

设法找出问题所在: 项目位于EU 中(因此,数据存储区位于EU区域;与AppEningine区域相同),而(不覆盖zone-option)时, Dataflow作业 本身(以及工作人员)默认托管在美国 p>

表现的差异是25-30倍:〜​​40个元素/ s,而15名工作者的差异<1200元素/ s。


I am trying to enhance data in a pipeline by querying Datastore in a DoFn step. A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.

The code looks like this:

public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {

private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");

@Override
public void processElement(ProcessContext c) throws Exception {

    CustomClass event = c.element();

    Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));

    String articleName = "";
    try{
        articleName = article.getString("articleName");         
    } catch(Exception e) {}

    CustomClass enhanced = new CustomClass(event);
    enhanced.setArticleName(articleName);

    c.output(enhanced);
}

When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?

A picture of the pipeline can be found here (the last step is the enhancing step): pipeline architecture

解决方案

After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).

The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.

这篇关于Dataflow DoFn中的数据存储查询在云中运行时会降低管道速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆