分区表 [英] Partitioning a table

查看:121
本文介绍了分区表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



让我举个例子,我有一个包含 inserted_timestamp 字段。让我们看看这个字段的日期是从1年前开始的。



将现有数据移动到新分区表的正确方法是什么?



已编辑



我发现Java上有一个优雅的解决方案, 2.0 Sharding BigQuery输出表也在 Beam流BigQuery分区这是参数化表名(或分区后缀)窗口数据。



但我错过 BigQueryIO.Write on 2.x梁项目也没有关于从python序列化函数获取窗口时间的示例。



我试图在管道上进行分区,但如果失败并且有大量分区(运行100,但失败1000)。



这是我的代码,尽可能:

 (p 
|'lectura'>> beam.io.ReadFromText(input_table)
|'noheaders'>> beam.Filter(lambda s: s [0] .isdigit())
| 'addtimestamp'>> beam.ParDo(AddTimestampDoFn())
| 'window'>> beam.WindowInto(beam.window.FixedWindows(60))
| 'table2row'>> beam.Map(to_table_row)
| 'write2table'>> beam.io.Write(beam.io.BigQuerySink(
output_table,#< - 无法通过窗口参数化
dataset = my_dataset,
project = project,
schema = 'dia:DATE,classe:STRING,cp:STRING,import:FLOAT',
create_disposition = CREATE_IF_NEEDED,
write_disposition = WRITE_TRUNCATE,




p.run()


解决方案

所有必要的功能都存在于Beam中,尽管它目前可能仅限于Java SDK。



您可以使用 BigQueryIO 。具体来说,您可以使用 DynamicDestinations 来确定每行的目标表。



从动态目录的示例:

  events.apply(BigQueryIO。< UserEvent> write()
.to(new DynamicDestinations< UserEvent,String>(){
public String getDestination(ValueInSingleWindow< String>元素){
return element.getValue()。getUserId();
}
public TableDestination getTable(String user){
return new TableDestination(tableForUser (user),
Table for user+ user);
}
public TableSchema getSchema(String user){
return tableSchemaForUser(user);
}
})
.withFormatFunction(new SerializableFunction< UserEvent,TableRow&g t;(){
public TableRow apply(UserEvent event){
return convertUserEventToTableRow(event);
}
}));


Bigquery allow partitioning, only by date, at this time.

Lets supose I have a 1billion table rows with inserted_timestamp field. Lets supose this field has dates from 1 year ago.

What is the right way to move existing data to a new partitioned table?

Edited

I saw there was a elegant solution on Java with version < 2.0 Sharding BigQuery output tables also elaborated at BigQuery partitioning with Beam streams that is to parametrize table name ( or partition suffix ) windowing data.

But I miss BigQueryIO.Write on 2.x beam project also there is no samples about get window time from python serializable function.

I tried to make partitions on pipe but if fails with a large number of partitions ( runs with 100 but fails with 1000 ).

This is my code as far as I could:

               (  p
                | 'lectura' >> beam.io.ReadFromText(input_table)
                | 'noheaders' >> beam.Filter(lambda s: s[0].isdigit())
                | 'addtimestamp' >> beam.ParDo(AddTimestampDoFn())
                | 'window' >> beam.WindowInto(beam.window.FixedWindows(60))
                | 'table2row'  >> beam.Map( to_table_row )  
                | 'write2table' >> beam.io.Write(beam.io.BigQuerySink(
                        output_table,   #<-- unable to parametrize by window
                        dataset=my_dataset, 
                        project=project, 
                        schema='dia:DATE, classe:STRING, cp:STRING, import:FLOAT',
                        create_disposition=CREATE_IF_NEEDED,
                        write_disposition=WRITE_TRUNCATE,
                                    )
                                )
                )

p.run()

解决方案

All of the functionality necessary to do this exists in Beam, although it may currently be limited to the Java SDK.

You would use BigQueryIO. Specifically, you may use DynamicDestinations to determine a destination table for each row.

From the example of DynamicDestinations:

events.apply(BigQueryIO.<UserEvent>write()
  .to(new DynamicDestinations<UserEvent, String>() {
        public String getDestination(ValueInSingleWindow<String> element) {
          return element.getValue().getUserId();
        }
        public TableDestination getTable(String user) {
          return new TableDestination(tableForUser(user), 
            "Table for user " + user);
        }
        public TableSchema getSchema(String user) {
          return tableSchemaForUser(user);
        }
      })
  .withFormatFunction(new SerializableFunction<UserEvent, TableRow>() {
     public TableRow apply(UserEvent event) {
       return convertUserEventToTableRow(event);
     }
   }));

这篇关于分区表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆