数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

查看：72 发布时间：2021/4/7 20:56:53 google-cloud-platform google-cloud-dataflow apache-beam google-cloud-pubsub

本文介绍了数据流自动缩放不会提高性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在构建一个从pubsub读取数据流并将其请求发送到第三方API的数据流管道.管道使用 THROUGHPUT_BASED 自动缩放.

I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling.

但是，当我对其进行负载测试时，它自动缩放到4个工作量以赶上pubsub中的积压工作，但是似乎在工作之间分散了相同的工作负载事件，但是总体吞吐量并未显着增加.

However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly.

^ pubsub中未确认的消息数.高峰是流量停止进入的时间

^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in

^从每个工作人员发送的字节.高峰是最初的工作人员.随着将更多的工作人员添加到池中，工作量将被卸载，而不是每个人都承担更多的工作量.CPU利用率看起来相同，初始工作人员的峰值利用率低于30％.

^ Bytes sent from each worker. The peak is the initial worker. As more workers were added to the pool, the workload is offloaded, instead of each of them picking up more workload. The CPU utilization looks the same, where the peak utilization is below 30% for the initial worker.

^工人的历史产生了.

^ The history of worker spawned.

感觉某处受到限制，但是我很难知道该限制是什么.我每秒提取不到300条消息，每条消息大约1kb.

It feels like either there is a limitation being hit somewhere, but I have a hard time seeing what the limitation is. I was pulling less than 300 messages per second, and each message is about 1kb.

更新:我在使用TextIO的批处理作业和使用PubSubIO的流处理作业之间进行了另一轮比较，两者均使用"n1-standard-8"计算机，并且固定的工作人数为15.批处理的作业速度提高到450个元素/秒，但流式处理作业仍达到230个元素/秒的峰值.限制似乎来自源头.尽管我不确定有什么限制.

Update: I did another round of comparison between batched job using TextIO and streaming job using PubSubIO, both with "n1-standard-8" machines and fixed number of workers to 15. The batched job went up to 450 elements/s, but the streaming job still peaked at 230 elements/s. It seems the limitation is coming from the source. Although I'm not sure what was the limitation.

更新2 这是一个重现此问题的简单代码段.您将需要手动将工程数量设置为1和5，并比较管道处理的元素数量.您将需要一个负载测试器来有效地将消息发布到该主题.

Update 2 Here is a simple code snippet to reproduce the issue. You will need to manually set number of works to 1 and 5 and compare the number of element processed by the pipeline. You will need a load tester to efficiently publish messages to the topic.

package debug;

import java.io.IOException;

import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class DebugPipeline {
    @SuppressWarnings("serial")
    public static PipelineResult main(String[] args) throws IOException {

        /*******************************************
         * SETUP - Build options.
         ********************************************/

        DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
                .as(DataflowPipelineOptions.class);
        options.setRunner(DataflowRunner.class);
        options.setAutoscalingAlgorithm(
                DataflowPipelineWorkerPoolOptions.AutoscalingAlgorithmType.THROUGHPUT_BASED);
        // Autoscaling will scale between n/15 and n workers, so from 1-15 here
        options.setMaxNumWorkers(15);
        // Default of 250GB is absurdly high and we don't need that much on every worker
        options.setDiskSizeGb(32);
        // Manually configure scaling (i.e. 1 vs 5 for comparison)
        options.setNumWorkers(5);

        // Debug Pipeline
        Pipeline pipeline = Pipeline.create(options);
        pipeline
            .apply(PubsubIO.readStrings()
                    .fromSubscription("your subscription"))
            // this is the transform that I actually care about. In production code, this will
            // send a REST request to some 3rd party endpoint.
            .apply("sleep", ParDo.of(new DoFn<String, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) throws InterruptedException {
                    Thread.sleep(500);
                    c.output(c.element());
                }
            }));

        return pipeline.run();
    }
}

数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据流自动缩放不会提高性能 [英] Dataflow autoscale does not boost performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭