首页
其他开发
使用RDD中的索引扫描Spark中的Hadoop数据库表

使用RDD中的索引扫描Spark中的Hadoop数据库表 [英] Scan a Hadoop Database table in Spark using indices from an RDD

查看：266 发布时间：2018/6/5 13:13:34 apache-spark hbase pyspark rdd

本文介绍了使用RDD中的索引扫描Spark中的Hadoop数据库表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，如果数据库中有一个表格，如下所示：

Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX $ b $ AAA3 XXX XXX XXX $ b $ AAA4 XXX XXX XXX $ b AAA5 XXX XXX XXX $ b $ AAA6 XXX XXX XXX $ b AAA7 XXX XXX XXX AAA8 XXX XXX BBB1 XXX XXX BBB2 XXX XXX BBB3 XXX XXX XXX BBB XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX $ b $ C CCC2 XXX XXX XXX $ b $ CCC3 XXX XXX XXX $ b $ CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX $ b $ DDD2 XXX XXX XXX $ b DDD3 XXX XXX XXX $ b $ DDD4 XXX XXX XXX $ b $ DDD5 XXX XXX DDD6 XXX XXX XXX DDD7 XXX XXX XXX
我有第二张表，给出为
1 AAA 2 DDD 3 CCC
由于AAA，DDD和CCC在表2中，我想扫描所有对应的行这3个值来自table1，即我想得到AAA1-AAA8，DDD1 -DDD7和CCC1-CCC7。
table1是一个Hadoop数据库表，并有数百万行。我只想扫描它从我得到的table1行。
任何人都可以用这种方法帮助我吗？我有table1作为RDD，而table2在HBase中。
解决方案
实际上，难度在于设置HBase连接器 Hortonworks 或从华为。

但无论如何，我认为你询问查询本身，所以我很快用Hive构建了一个玩具示例（即，使用shell创建HBase表，然后在Hive中添加 create external table >）。然后我使用Hive上下文创建一个SQL上下文。

pre $
from pyspark.sql import HiveContext sqlContext = HiveContext（sc）
3行：

df = sqlContext.sql（select * from hbase_table_1） df.show（3） + ---- + -------- + |关键| COLUMN_1 | + ---- + -------- + | AAA1 | ABCD | | AAA2 | EFGH | | BBB1 | JKLM | + ---- + -------- +
以及访问HBase的一个子集 rowkeys ：
df = sqlContext.sql （select * from hbase_table_1 where key> ='AAA'and key<'BBB'） df.show（3） + ---- + ------- - + |关键| COLUMN_1 | + ---- + -------- + | AAA1 | ABCD | | AAA2 | EFGH | + ---- + -------- +
你应该明确地去找一个HBase连接器，但是一旦你有了它（至少对于Hortonworks来说），查询应该是一样的。

So if there is a table in the database shown as below:

Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX AAA3 XXX XXX XXX AAA4 XXX XXX XXX AAA5 XXX XXX XXX AAA6 XXX XXX XXX AAA7 XXX XXX XXX AAA8 XXX XXX XXX BBB1 XXX XXX XXX BBB2 XXX XXX XXX BBB3 XXX XXX XXX BBB4 XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX CCC2 XXX XXX XXX CCC3 XXX XXX XXX CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX DDD2 XXX XXX XXX DDD3 XXX XXX XXX DDD4 XXX XXX XXX DDD5 XXX XXX XXX DDD6 XXX XXX XXX DDD7 XXX XXX XXX I have a 2nd table, given as 1 AAA 2 DDD 3 CCC Since AAA,DDD and CCC are in table2, I want to scan all rows corresponding to these 3 values from table1, i.e, I want to get AAA1-AAA8, DDD1 -DDD7 and CCC1-CCC7. The table1 is a Hadoop database table, and has millions of rows. I only want to scan it for rows which I get from table1. Could anyone help me with an efficent way of doing this? I have table1 as an RDD, and table2 is in HBase.
解决方案
The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.

But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a create external table in Hive).

Then I create a SQL context using the Hive context.
from pyspark.sql import HiveContext sqlContext = HiveContext(sc)
The full toy table has 3 rows:
df = sqlContext.sql("select * from hbase_table_1") df.show(3) +----+--------+ | key|column_1| +----+--------+ |AAA1| abcd| |AAA2| efgh| |BBB1| jklm| +----+--------+
and to access a subset of the HBase rowkeys:
df = sqlContext.sql("select * from hbase_table_1 where key >= 'AAA' and key < 'BBB'") df.show(3) +----+--------+ | key|column_1| +----+--------+ |AAA1| abcd| |AAA2| efgh| +----+--------+
For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.

这篇关于使用RDD中的索引扫描Spark中的Hadoop数据库表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

相关文章

Spark:每个Spark RDD分区的数据库连接并执行mapPartition;

如何使用Spark SQL列出数据库中的所有表?;

如何使用 Spark SQL 列出数据库中的所有表?;

SQL中的表扫描和索引扫描;

数据库表中索引的缺点是什么？;

Spark中数据框中和RDD的区别;

使用火花和RDD映射卡桑德拉数据库的表;

Spark中的RDD示例;

使用Java schemacrawler，为什么要扫描我的数据库中的每个表？;

数据库中的数据库(表设计);

使用数据库中的多个表;

比较Spark中两个RDD中的数据;

如何在 MySQL 中查看数据库或表的索引?;

如何在MySQL中查看数据库或表的索引?;

如何查看数据库或表的索引？;

重建数据库中的所有索引;

使用Python将带有Unicode数据列表的RDD转换为Spark中的元组的RDD;

Spark RDD中的多个分区;

Spark RDD 中的多个分区;

spark中的RDD是什么;

如何在 Spark RDD (Java) 中按索引获取元素;

使用数据库索引;

轮询数据库中的表;

显示数据库中的表;

Spark:使用同一RDD中的其他元素映射RDD的元素;

其他开发最新文章

拒绝显示一个框架，因为它将'X-Frame-Options'设置为'sameorigin';

什么是＆QUOT; AW＆QUOT;在部分标志属性是什么意思？;

在运行npm install命令时获取'npm WARN弃用'警告;

cmake无法找到openssl;

从Spark的scala中的* .tar.gz压缩文件中读取HDF5文件;

Twitter :: Error :: Forbidden - 无法验证您的凭据;

我什么时候需要一个fb：app_id或者fb：admins？;

将.db文件导入R;

npm通知创建一个lockfile作为package-lock.json。你应该提交这个文件;

拒绝执行内联脚本，因为它违反了以下内容安全策略指令：“script-src'self'”;

热门教程

Java教程

Apache ANT 教程

Kali Linux教程

JavaScript教程

JavaFx教程

MFC 教程

Apache HTTP客户端教程

Microsoft Visio 教程

热门工具

Java 在线工具

C(GCC) 在线工具

PHP 在线工具

C# 在线工具

Python 在线工具

MySQL 在线工具

VB.NET 在线工具

Lua 在线工具

Oracle 在线工具

C++(GCC) 在线工具

Go 在线工具

Fortran 在线工具

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆

友情链接： IT屋 Chrome插件谷歌浏览器插件

IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除.