与 BigQuery Table 输入大小相比,为什么我的 PCollection (SCollection) 大小如此之大? [英] Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

查看:30
本文介绍了与 BigQuery Table 输入大小相比,为什么我的 PCollection (SCollection) 大小如此之大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上图是一个大查询表的表模式,它是在 spotify 的 scio 上运行的 apache 光束数据流作业的输入.如果您不熟悉 scio,它是围绕 Apache Beam Java SDK 的 Scala 包装器.特别是,SCollection 包装 PCollection".我在 BigQuery 磁盘上的输入表是 136 gigs,但在数据流 UI 中查看我的 SCollection 的大小时,它是 504.91 GB.

The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a "SCollection wraps PCollection". My input table on BigQuery disk is 136 gigs, but upon looking at the size of my SCollection in the dataflow UI it is 504.91 GB.

我知道 BigQuery 在数据压缩和表示方面可能要好得多,但大小增加 3 倍以上似乎相当高.非常清楚,我使用的是类型安全大查询案例类(我们称之为 Clazz)表示,所以我的 SCollection 是 SCollection[Clazz] 类型而不是 SCollection[TableRow].TableRow 是 Java JDK 中的原生表示.关于如何减少内存分配的任何提示?它与我输入中的特定列类型有关:字节、字符串、记录、浮点数等?

I understand that BigQuery is likely much better at data compression and representation, but a >3x increase in size seems quite high. To be very clear I'm using Type Safe Big Query Case Class (let's call it Clazz) representation, so my SCollection is of type SCollection[Clazz] instead of SCollection[TableRow]. TableRow is the native representation in the Java JDK. Any tips on how to keep the memory allocation down? It is related to a particular column type in my input: Bytes, Strings, Record, Floats, etc?

推荐答案

这可能是由于 TableRow 格式包含列的字符串名称,这会增加大小.

This is likely due to the TableRow format which contains string names for the columns, that add to the size.

考虑 使用以下内容 创建对象的 PCollection 而不是 TableRows.这允许您直接读入与模式匹配的对象,这应该会稍微减少数据大小.

Consider using the following to create a PCollection of objects instead of TableRows. This allows you to directly read into an object which matches the schema, which should reduce the data size a little bit.

  /**
   * Reads from a BigQuery table or query and returns a {@link PCollection} with one element per
   * each row of the table or query result, parsed from the BigQuery AVRO format using the specified
   * function.
   *
   * <p>Each {@link SchemaAndRecord} contains a BigQuery {@link TableSchema} and a
   * {@link GenericRecord} representing the row, indexed by column name. Here is a
   * sample parse function that parses click events from a table.
   *
   * <pre>{@code
   * class ClickEvent { long userId; String url; ... }
   *
   * p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, ClickEvent>() {
   *   public ClickEvent apply(SchemaAndRecord record) {
   *     GenericRecord r = record.getRecord();
   *     return new ClickEvent((Long) r.get("userId"), (String) r.get("url"));
   *   }
   * }).from("...");
   * }</pre>
   */
  public static <T> TypedRead<T> read(
      SerializableFunction<SchemaAndRecord, T> parseFn) {

这篇关于与 BigQuery Table 输入大小相比,为什么我的 PCollection (SCollection) 大小如此之大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆