首页
其他开发
Dataproc + BigQuery示例 - 任何可用的？

Dataproc + BigQuery示例 - 任何可用的？ [英] Dataproc + BigQuery examples - any available?

查看：183 发布时间：2018/5/7 17:21:27 google-bigquery google-cloud-platform google-cloud-dataproc

本文介绍了Dataproc + BigQuery示例 - 任何可用的？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据Dataproc docos ，它具有自动与自动集成与BigQuery 。

我在BigQuery中有一个表格。我想读取该表并使用我创建的Dataproc集群（使用PySpark作业）对其进行一些分析。然后将此分析的结果写回到BigQuery。您可能会问为什么不直接在BigQuery中进行分析！？ - 原因是因为我们正在创建复杂的统计模型，而SQL对于开发它们来说太高。我们需要像Python或R，ergo Dataproc之类的东西。

他们是否有Dataproc + BigQuery示例？我找不到任何。 解决方案

首先，如这个问题 BigQuery连接器预装在 Cloud Dataproc 集群。

这里是一个关于如何将数据从BigQuery读取到Spark中的例子。在这个例子中，我们将从BigQuery读取数据来执行字数统计。
您可以使用 SparkContext.newAPIHadoopRDD 从Spark中的BigQuery中读取数据。 Spark文档有有关使用 SparkContext.newAPIHadoopRDD 的更多信息。 '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat import com.google.gson.JsonObject import org.apache.hadoop.io.LongWritable $ b $ val projectId =< your-project-id> val fullyQualifiedInputTableId =publicdata：samples.shakespeare val fullyQualifiedOutputTableId =< your-fully-qualified-table-id> val outputTableSchema = [{'name'：'Word'，'type'：'STRING'}，{'name'：'Count'，'type'：'INTEGER'}] val jobName =wordcount val conf = sc.hadoopConfiguration //设置作业级别的projectId。 conf.set（BigQueryConfiguration.PROJECT_ID_KEY，projectId） //使用systemBucket获取InputFormat使用的临时BigQuery导出数据。 val systemBucket = conf.get（fs.gs.system.bucket） conf.set（BigQueryConfiguration.GCS_BUCKET_KEY，systemBucket） //将输入和输出配置为BigQuery访问。 BigQueryConfiguration.configureBigQueryInput（conf，fullyQualifiedInputTableId） BigQueryConfiguration.configureBigQueryOutput（conf， fullyQualifiedOutputTableId，outputTableSchema） $ b $ val fieldName =word val tableData = sc.newAPIHadoopRDD（conf， classOf [GsonBigQueryInputFormat]，classOf [LongWritable]，classOf [JsonObject]） tableData.cache（） tableData.count（） tableData.map（entry =>（entry._1.toString（），entry._2.toString（）））。take（10）
您需要使用您的设置自定义此示例，其中包括您在< your-project-id> 和你的输出表ID在< your-fully-qualified-table-id> 。
如果最终在MapReduce中使用BigQuery连接器，此页面如何编写MapReduce作业的示例使用BigQuery连接器。
According to the Dataproc docos, it has "native and automatic integrations with BigQuery". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc. Are they any Dataproc + BigQuery examples available? I can't find any. 解决方案 To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters. Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count. You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. ' import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat import com.google.gson.JsonObject import org.apache.hadoop.io.LongWritable val projectId = "<your-project-id>" val fullyQualifiedInputTableId = "publicdata:samples.shakespeare" val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>" val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" val jobName = "wordcount" val conf = sc.hadoopConfiguration // Set the job-level projectId. conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId) // Use the systemBucket for temporary BigQuery export data used by the InputFormat. val systemBucket = conf.get("fs.gs.system.bucket") conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket) // Configure input and output for BigQuery access. BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId) BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema) val fieldName = "word" val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject]) tableData.cache() tableData.count() tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10) You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>. Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector. 这篇关于Dataproc + BigQuery示例 - 任何可用的？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

相关文章 Dataproc + BigQuery 示例 - 有没有可用的?; BigQuery Hadoop连接器& Dataproc; 是否有任何示例/示例可用于iOS的Google People API?; Bigquery + PHP 示例; Bigquery + PHP示例; Dataproc上PySpark中的BigQuery连接器ClassNotFoundException; Google Cloud Dataproc删除BigQuery表不起作用; Dataproc: Errors when reading and writing data from BigQuery using PySpark; 如何使用dataproc从与dataproc集群不在同一个项目中的bigquery提取数据?; 将Google Dataproc中的配置单元表中的数据移至BigQuery; 带有 .NET 文档/示例的 Google BigQuery; Google Firebase / BigQuery - 以前没有可用的数据; 任何NSWindowRestoration示例？; Lisp宏可用于的示例; D模板可用于的示例; 谷歌BigQuery中使用.NET文档/示例; 寻找BigQuery分析的示例doubleclick广告投放日志; 任何可用的ccTalk库？; BigQuery 流式插入数据可用性延迟; BigQuery流插入数据可用性延迟; BigQuery 同步查询未返回任何结果; BigQuery同步查询不返回任何结果; 任何DDML示例或文档？; 无法运行任何ggsubplot示例; 如何使用BigQuery获取任何城市的历史天气?;

其他开发最新文章拒绝显示一个框架，因为它将'X-Frame-Options'设置为'sameorigin'; 什么是＆QUOT; AW＆QUOT;在部分标志属性是什么意思？; 在运行npm install命令时获取'npm WARN弃用'警告; cmake无法找到openssl; 从Spark的scala中的* .tar.gz压缩文件中读取HDF5文件; Twitter :: Error :: Forbidden - 无法验证您的凭据; 我什么时候需要一个fb：app_id或者fb：admins？; 将.db文件导入R; npm通知创建一个lockfile作为package-lock.json。你应该提交这个文件; 拒绝执行内联脚本，因为它违反了以下内容安全策略指令：“script-src'self'”; 热门教程 Java教程 Apache ANT 教程 Kali Linux教程 JavaScript教程 JavaFx教程 MFC 教程 Apache HTTP客户端教程 Microsoft Visio 教程热门工具 Java 在线工具 C(GCC) 在线工具 PHP 在线工具 C# 在线工具 Python 在线工具 MySQL 在线工具 VB.NET 在线工具 Lua 在线工具 Oracle 在线工具 C++(GCC) 在线工具 Go 在线工具 Fortran 在线工具

登录关闭扫码关注1秒登录发送“验证码”获取 | 15天全站免登陆友情链接： IT屋 Chrome插件谷歌浏览器插件 IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除.