动态数据生成以对Beam进行基准测试 [英] On-the-fly data generation for benchmarking Beam

查看:78
本文介绍了动态数据生成以对Beam进行基准测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是在具有不同窗口查询的流数据用例上对Apache Beam的延迟和吞吐量进行基准测试.

My goal is to benchmark the latency and the throughput of Apache Beam on a streaming data use-case with different window queries.

我想使用实时数据生成器创建自己的数据,以手动控制数据生成速率,并直接从没有pub/sub机制的管道中使用此数据,即,我不想读取来自经纪人等的数据,以避免瓶颈. 有什么方法可以做与我想达到的目标类似的事情吗?还是使用Beam SDK的这种用例有任何源代码? 到目前为止,我找不到起点,现有的代码示例使用pub/sub机制,并且它们假定数据来自某个地方.

I want to create my own data with an on-the-fly data generator to control the data generation rate manually and consume this data directly from a pipeline without a pub/sub mechanism, i.e. I don't want to read the data from a broker, etc. to avoid bottlenecks. Is there a way of doing something similar to what I want to achieve? or is there any source code for such use-case with Beam SDKs? So far I couldn't find a starting point, existing code samples use pub/sub mechanism and they assume data comes from somewhere.

谢谢您的建议.

推荐答案

关于即时数据,一种选择是例如使用GenerateSequence:

With regards to On-the-fly data, one option would be to make use of GenerateSequence for example:

pipeline.apply(GenerateSequence.from(0).withRate(RATE,Duration.millis(1000)))

要创建其他类型的对象,您可以在之后使用ParDo消耗Long并将其制成其他东西:

To create other types of objects you can use a ParDo after to consume the Long and make it into something else:

Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
    p.apply(GenerateSequence.from(0).withRate(2, Duration.millis(1000)))
    .apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
    .apply(FlatMapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
        .via(i -> IntStream.range(0,2).mapToObj(k -> KV.of(String.format("Gen Value %s" , i),String.format("FlatMap Value %s ", k))).collect(Collectors.toList())))
    .apply(ParDo.of(new DoFn<KV<String,String>, String>() {
      @ProcessElement
      public void process(@Element KV<String,String> input){
        LOG.info("Value was {}", input);
      }
    }));
p.run();

那应该产生像这样的值:

That should generate values like:

Value was KV{Gen Value 0, FlatMap Value 0 }
Value was KV{Gen Value 0, FlatMap Value 1 }
Value was KV{Gen Value 1, FlatMap Value 0 }
Value was KV{Gen Value 1, FlatMap Value 1 }
Value was KV{Gen Value 2, FlatMap Value 0 }
Value was KV{Gen Value 2, FlatMap Value 1 }

管道性能测试中要记住其他一些事情:

Some other things to keep in mind for your pipelines performance testing:

  • Direct运行器专为单元测试而设计,它可以完成诸如模拟故障之类的出色工作,这有助于捕获在运行生产管道时将看到的问题.但是,它并非旨在帮助性能测试.对于那些类型的集成测试,我建议始终使用主跑者.

  • The Direct runner is designed for unit testing, it does cool things like simulate failures, this helps catch issues which will be seen when running a production pipeline. It is not designed to help with performance testing however. I would recommend always using a main runner for those types of integration tests.

请注意Fusion优化链接到文档,当使用诸如GenerateSequence之类的人工数据源时,下一步可能需要执行GBK才能使工作并行化.有关数据流运行程序的更多信息,请参见:链接到文档

Please be aware of the Fusion optimization Link to Docs, when using a artificial data source like GenerateSequence you may need to do a GBK as the next step to allow the work to be parallelized. For the Dataflow runner more info can be found here: Link to Docs

通常,对于性能测试,我建议测试整个端到端管道.存在与源和接收器(例如水印)的交互,这些交互不会在独立的管道中进行测试.

In general for performance testing, I would recommend testing the whole end to end pipeline. There are interactions with sources and sinks ( for example watermarks ) which will not be tested in a standalone pipeline.

希望有帮助.

这篇关于动态数据生成以对Beam进行基准测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆