为什么我的PCollection(SCollection)大小比BigQuery Table输入大小大? [英] Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

查看：110 发布时间：2020/7/25 19:15:21 google-bigquery google-cloud-dataflow apache-beam spotify-scio

本文介绍了为什么我的PCollection(SCollection)大小比BigQuery Table输入大小大?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

上图是一个大查询表的表架构，该表是在Spotify的scio上运行的apache Beam数据流作业的输入.如果您不熟悉scio，那么它就是Apache Beam Java SDK周围的Scala包装器.特别是，"SCollection包装PCollection".我在BigQuery磁盘上的输入表是136个演出，但是在数据流UI中查看我的SCollection的大小时，它的大小是504.91 GB.

The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a "SCollection wraps PCollection". My input table on BigQuery disk is 136 gigs, but upon looking at the size of my SCollection in the dataflow UI it is 504.91 GB.

我了解到BigQuery可能在数据压缩和表示方面要好得多，但是大小增加3倍以上似乎很高.很清楚，我正在使用类型安全的大查询案例类(我们称其为Clazz)表示形式，因此我的SCollection是SCollection [Clazz]类型，而不是SCollection [TableRow]. TableRow是Java JDK中的本机表示形式.关于如何减少内存分配的任何技巧?它与我输入中的特定列类型有关:字节，字符串，记录，浮点数等?

I understand that BigQuery is likely much better at data compression and representation, but a >3x increase in size seems quite high. To be very clear I'm using Type Safe Big Query Case Class (let's call it Clazz) representation, so my SCollection is of type SCollection[Clazz] instead of SCollection[TableRow]. TableRow is the native representation in the Java JDK. Any tips on how to keep the memory allocation down? It is related to a particular column type in my input: Bytes, Strings, Record, Floats, etc?

推荐答案

这很可能是由于TableRow格式包含了列的字符串名称，并增加了大小.

This is likely due to the TableRow format which contains string names for the columns, that add to the size.

考虑

Consider using the following to create a PCollection of objects instead of TableRows. This allows you to directly read into an object which matches the schema, which should reduce the data size a little bit.

  /**
   * Reads from a BigQuery table or query and returns a {@link PCollection} with one element per
   * each row of the table or query result, parsed from the BigQuery AVRO format using the specified
   * function.
   *
   * <p>Each {@link SchemaAndRecord} contains a BigQuery {@link TableSchema} and a
   * {@link GenericRecord} representing the row, indexed by column name. Here is a
   * sample parse function that parses click events from a table.
   *
   * <pre>{@code
   * class ClickEvent { long userId; String url; ... }
   *
   * p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, ClickEvent>() {
   *   public ClickEvent apply(SchemaAndRecord record) {
   *     GenericRecord r = record.getRecord();
   *     return new ClickEvent((Long) r.get("userId"), (String) r.get("url"));
   *   }
   * }).from("...");
   * }</pre>
   */
  public static <T> TypedRead<T> read(
      SerializableFunction<SchemaAndRecord, T> parseFn) {

这篇关于为什么我的PCollection(SCollection)大小比BigQuery Table输入大小大?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么我的PCollection(SCollection)大小比BigQuery Table输入大小大? [英] Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么我的PCollection(SCollection)大小比BigQuery Table输入大小大? [英] Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭