数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

查看：32 发布时间：2021/11/11 22:39:02 google-cloud-platform google-cloud-dataflow apache-beam google-cloud-pubsub

本文介绍了数据流自动缩放不会提高性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在构建一个 Dataflow 管道，该管道从 pubsub 读取并将请求发送到 3rd 方 API.管道使用 THROUGHPUT_BASED 自动缩放.

I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling.

然而，当我对它进行负载测试时，在它自动扩展到 4 个工作以赶上 pubsub 中的积压后，但似乎相同的工作负载在工作之间分散了事件，但总体吞吐量并没有显着增加.

However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly.

^ pubsub 中未确认消息的数量.高峰期是交通停止进入时

^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in

^ 每个工人发送的字节数.高峰是初始工人.随着更多的工作人员被添加到池中，工作量会被卸载，而不是每个人都承担更多的工作量.CPU 利用率看起来相同，初始工作器的峰值利用率低于 30%.

^ Bytes sent from each worker. The peak is the initial worker. As more workers were added to the pool, the workload is offloaded, instead of each of them picking up more workload. The CPU utilization looks the same, where the peak utilization is below 30% for the initial worker.

^ 工人产生的历史.

感觉好像在某个地方遇到了限制，但我很难看出限制是什么.我每秒拉不到 300 条消息，每条消息大约 1kb.

It feels like either there is a limitation being hit somewhere, but I have a hard time seeing what the limitation is. I was pulling less than 300 messages per second, and each message is about 1kb.

更新:我对使用 TextIO 的批处理作业和使用 PubSubIO 的流作业进行了另一轮比较，两者都使用n1-standard-8"机器和固定数量的工人为 15.批处理作业达到 450 个元素/秒，但流作业仍达到 230 元素/秒的峰值.似乎限制来自源头.虽然我不确定是什么限制.

Update: I did another round of comparison between batched job using TextIO and streaming job using PubSubIO, both with "n1-standard-8" machines and fixed number of workers to 15. The batched job went up to 450 elements/s, but the streaming job still peaked at 230 elements/s. It seems the limitation is coming from the source. Although I'm not sure what was the limitation.

更新 2这是一个简单的代码片段来重现该问题.您需要手动将作品数量设置为 1 和 5，并比较管道处理的元素数量.您将需要一个负载测试器来有效地向主题发布消息.

Update 2 Here is a simple code snippet to reproduce the issue. You will need to manually set number of works to 1 and 5 and compare the number of element processed by the pipeline. You will need a load tester to efficiently publish messages to the topic.

package debug;

import java.io.IOException;

import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class DebugPipeline {
    @SuppressWarnings("serial")
    public static PipelineResult main(String[] args) throws IOException {

        /*******************************************
         * SETUP - Build options.
         ********************************************/

        DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
                .as(DataflowPipelineOptions.class);
        options.setRunner(DataflowRunner.class);
        options.setAutoscalingAlgorithm(
                DataflowPipelineWorkerPoolOptions.AutoscalingAlgorithmType.THROUGHPUT_BASED);
        // Autoscaling will scale between n/15 and n workers, so from 1-15 here
        options.setMaxNumWorkers(15);
        // Default of 250GB is absurdly high and we don't need that much on every worker
        options.setDiskSizeGb(32);
        // Manually configure scaling (i.e. 1 vs 5 for comparison)
        options.setNumWorkers(5);

        // Debug Pipeline
        Pipeline pipeline = Pipeline.create(options);
        pipeline
            .apply(PubsubIO.readStrings()
                    .fromSubscription("your subscription"))
            // this is the transform that I actually care about. In production code, this will
            // send a REST request to some 3rd party endpoint.
            .apply("sleep", ParDo.of(new DoFn<String, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) throws InterruptedException {
                    Thread.sleep(500);
                    c.output(c.element());
                }
            }));

        return pipeline.run();
    }
}

数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭