在Google Cloud Bigtable中填充数据需要很长时间 [英] Populating data in Google Cloud Bigtable is taking long time

查看:119
本文介绍了在Google Cloud Bigtable中填充数据需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码将数据填充到Bigtable中:

I am using the following code to populate data into Bigtable:

CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
                .withConfiguration("clusterId", options.getBigTableClusterId())
                .withProjectId(options.getProject())
                .withInstanceId(options.getBigTableInstanceId())
                .withTableId(options.getOutputBTTable())
                .build();
     Pipeline p = Pipeline.create(options);
     /**
      * Read Data from Big Query
      */
     CloudBigtableIO.initializeForWrite(p);
     p.apply(BigQueryIO.Read.fromQuery(getQuery(options.getDate())))
        .apply(ParDo.of(new DoFn<TableRow, Mutation>() {
           public void processElement(ProcessContext c) {
             Mutation output = convertDataToRow(c.element());
             if (output != null) { 
                 c.output(output); 
                 };
           }

           }))
         .apply(CloudBigtableIO.writeToTable(config));
     p.run();

private static Mutation convertDataToRow(TableRow element) {
     LOG.info("element: "+ element);
     if(element.get("BASM_AID") != null){
         Put obj = new Put(getRowKey(element).getBytes()).addColumn(SEGMENT_FAMILY, SEGMENT_COLUMN_NAME, ((String)element.get("BAS_category")).getBytes() );
                obj.addColumn(USER_FAMILY, AID, ((String)element.get("BASM_AID")).getBytes());
         if(element.get("BASM_segment_id") != null){
                obj.addColumn(SEGMENT_FAMILY, SEGMENT_ID, ((String)element.get("BASM_segment_id")).getBytes());
         }
         if(element.get("BAS_sub_category") != null){
                obj.addColumn(SEGMENT_FAMILY, SUB_CATEGORY, ((String)element.get("BAS_sub_category")).getBytes());
         }
         if(element.get("BAS_name") != null){
                obj.addColumn(SEGMENT_FAMILY, NAME, ((String)element.get("BAS_name")).getBytes());
         }
         if(element.get("BAS_description") != null){
                obj.addColumn(SEGMENT_FAMILY, DESCRIPTION, ((String)element.get("BAS_description")).getBytes());
         }
         if(element.get("BASM_krux_user_id") != null){
             obj.addColumn(USER_FAMILY, KRUX_USER_ID, ((String)element.get("BASM_krux_user_id")).getBytes());
         }
         if(element.get("BAS_last_compute_day") != null){
                obj.addColumn(SEGMENT_FAMILY, LAST_COMPUTE_DAY, ((String)element.get("BAS_last_compute_day")).getBytes());
         }
         if(element.get("BAS_type") != null){
                obj.addColumn(SEGMENT_FAMILY, TYPE, ((String)element.get("BAS_type")).getBytes());
         }      
         if(element.get("BASM_REGID") != null){
                obj.addColumn(USER_FAMILY, REGID, ((String)element.get("BASM_REGID")).getBytes() );
         }
        return obj;
     }else{
         return null;
     }
    }

我们有30个Bigtable节点,我的数据流工作正在与100名工人一起工作,整个过程必须处理大约100亿行数据,通过上述配置,我的工作需要一天以上的时间才能完成,这并不理想

We have 30 Bigtable Nodes and my data flow job is working with 100 workers, the whole process has to process around 10 billion rows of data, with the above configuration my job is taking more than one day to complete which is not ideal.

在代码级别上,我们可以更快地运行工作的任何建议,我知道增加Bigtable节点的数量是一种选择,但是目前,我正在寻找其他不必增加节点.

Any suggestions at the code level through which we can run our job little faster, I know increasing the number of Bigtable nodes is one of the option but currently, I am looking for other options where we don't have to increase the nodes.

推荐答案

您可能希望查看

You might wish to look at this question. Based on that, you'll want to look at the write bandwidth and if you are above 80%, you might wish to either cut down the number of your worker nodes, or ask for a quota increase and increase the size of your cluster.

这篇关于在Google Cloud Bigtable中填充数据需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆