合并两个大型数据集的最佳策略 [英] Best strategy for joining two large datasets

查看:76
本文介绍了合并两个大型数据集的最佳策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试找到处理两个非常大的数据集的最佳方法.

I'm currently trying to find the best way of processing two very large datasets.

我有两个BigQuery表:

I have two BigQuery Tables :

  • 一个包含流事件(十亿行)的表
  • 一个包含标签和相关事件属性(100000行)的表

我想根据事件属性使用适当的标记来标记每个事件(一个事件可以有多个标记).但是,对于数据集的大小,SQL交叉联接似乎太慢了.

I want to tag each event with the appropriate tags based on the event properties (an event can have multiple tags). However a SQL cross-join seems to be too slow for the dataset size.

使用mapreduces管道并避免发生的最佳方法是什么 洗牌阶段非常昂贵,因为必须将每个事件与每个标签进行比较.

What is the best way to proceed using a pipeline of mapreduces and avoiding very costly shuffle phase since each event has to be compared to each tag.

我还计划使用Google Cloud Dataflow,此工具是否适合该任务?

Also I'm planning to use Google Cloud Dataflow, is this tool adapted for this task?

推荐答案

Google Cloud Dataflow非常适合此操作.

Google Cloud Dataflow is a good fit for this.

假设标签数据足够小以适合内存,您可以通过将其作为

Assuming the tags data is small enough to fit in memory you can avoid a shuffle by passing it as a SideInput.

您的管道如下所示

  • 使用两个 BigQueryIO 转换以从每个表中读取.
  • 创建一个 DoFn ,以每个事件的标记对其进行标记.
  • DoFn的输入PCollection应该是事件.将标签表作为侧面输入传递.
  • 使用 BigQueryIO 转换将结果写回到BigQuery(假设您是想要使用BigQuery作为输出)
  • Use two BigQueryIO transforms to read from each table.
  • Create a DoFn to tag each event with its tags.
  • The input PCollection to your DoFn should be the events. Pass the table of tags as a side input.
  • Use a BigQueryIO transform to write the result back to BigQuery (assuming you want to use BigQuery for the output)

如果您的代码数据太大而无法容纳在内存中,则您很有可能必须使用

If your tags data is too large to fit in memory you will most likely have to use a Join.

这篇关于合并两个大型数据集的最佳策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆