Dataproc + BigQuery 示例 - 有没有可用的? [英] Dataproc + BigQuery examples - any available?

查看:19
本文介绍了Dataproc + BigQuery 示例 - 有没有可用的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Dataproc docos,它具有与 BigQuery 的本机和自动集成".

According to the Dataproc docos, it has "native and automatic integrations with BigQuery".

我在 BigQuery 中有一张表.我想读取该表并使用我创建的 Dataproc 集群(使用 PySpark 作业)对其进行一些分析.然后将此分析的结果写回 BigQuery.您可能会问为什么不直接在 BigQuery 中进行分析!?"- 原因是因为我们正在创建复杂的统计模型,而 SQL 级别太高,无法开发它们.我们需要 Python 或 R 之类的东西,ergo Dataproc.

I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.

是否有任何可用的 Dataproc + BigQuery 示例?我找不到.

Are they any Dataproc + BigQuery examples available? I can't find any.

推荐答案

开始,如this问题 BigQuery 连接器预装在 Cloud Dataproc 集群上.

To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.

以下示例介绍了如何将 BigQuery 中的数据读取到 Spark 中.在此示例中,我们将从 BigQuery 读取数据以执行字数统计.您可以使用 SparkContext.newAPIHadoopRDD 从 Spark 中的 BigQuery 读取数据.Spark 文档有有关使用 SparkContext.newAPIHadoopRDD 的更多信息.'

Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count. You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '

import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject

import org.apache.hadoop.io.LongWritable

val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
    "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"

val conf = sc.hadoopConfiguration

// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
    fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf,
    classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

您需要使用您的设置自定义此示例,包括 <your-project-id> 中的 Cloud Platform 项目 ID 和 <your-fully 中的输出表 ID-qualified-table-id>.

You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.

最后,如果您最终将 BigQuery 连接器与 MapReduce 一起使用,此页面 提供了如何使用 BigQuery 连接器编写 MapReduce 作业的示例.

Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.

这篇关于Dataproc + BigQuery 示例 - 有没有可用的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆