BigQuery到Hadoop集群 - 如何传输数据? [英] BigQuery to Hadoop Cluster - How to transfer data?

查看:134
本文介绍了BigQuery到Hadoop集群 - 如何传输数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Google Analytics(GA)帐户,可以跟踪应用的用户活动。我设置了BigQuery,以便我可以访问原始GA数据。数据每天都会从GA传送到BigQuery。

我有一个python应用程序,它以编程方式查询BigQuery API。这个应用程序给了我所需的响应,具体取决于我所要查询的内容。



我的下一步是从BigQuery获取这些数据并将其转储到Hadoop集群中。我想理想地使用这些数据创建一个配置单元表。我想围绕python应用程序构建一个类似于ETL过程的东西。例如,我每天都运行运行python应用程序的etl进程,并将数据导出到集群。

最后,这个ETL过程应该放在Jenkins上,并且应该能够在生产系统上运行。

在规划此ETL过程时,我需要考虑哪些体系结构/设计/一般因素?



有关我该如何处理这件事的任何建议?我有兴趣以最简单可行的方式做到这一点。



预先致谢。

解决方案

从BigQuery到Hadoop的最简单方法是使用Google BigQuery官方连接器for Hadoop


$ b

https://cloud.google.com/hadoop/bigquery-connector



此连接器定义了一个BigQueryInputFormat类。


  • 编写查询以选择合适的BigQuery对象。

  • 将查询的结果在Hadoop节点间平均分割。

  • 将split分解为java对象以传递给映射器。 Hadoop Mapper类接收每个选定BigQuery对象的JsonObject表示形式。

    $ b

    (它使用Google Cloud Storage作为BigQuery数据和Hadoop消耗的分割)

    I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.

    I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.

    My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.

    Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.

    What architecture/design/general factors would I need to consider while planning for this ETL process?

    Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.

    Thanks in advance.

    解决方案

    The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop

    https://cloud.google.com/hadoop/bigquery-connector

    This connector defines a BigQueryInputFormat class.

    • Write a query to select the appropriate BigQuery objects.
    • Splits the results of the query evenly among the Hadoop nodes.
    • Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.

    (It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)

    这篇关于BigQuery到Hadoop集群 - 如何传输数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆