使用RDD中的索引扫描Spark中的Hadoop数据库表 [英] Scan a Hadoop Database table in Spark using indices from an RDD

查看:266
本文介绍了使用RDD中的索引扫描Spark中的Hadoop数据库表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,如果数据库中有一个表格,如下所示:




Key2 DateTimeAge
AAA1 XXX XXX XXX
AAA2 XXX XXX XXX $ b $ AAA3 XXX XXX XXX $ b $ AAA4 XXX XXX XXX $ b AAA5 XXX XXX XXX $ b $ AAA6 XXX XXX XXX $ b AAA7 XXX XXX XXX
AAA8 XXX XXX
BBB1 XXX XXX
BBB2 XXX XXX
BBB3 XXX XXX XXX
BBB XXX XXX XXX
BBB5 XXX XXX XXX
CCC1 XXX XXX XXX $ b $ C CCC2 XXX XXX XXX $ b $ CCC3 XXX XXX XXX $ b $ CCC4 XXX XXX XXX
CCC5 XXX XXX XXX
CCC6 XXX XXX XXX
CCC7 XXX XXX XXX
DDD1 XXX XXX XXX $ b $ DDD2 XXX XXX XXX $ b DDD3 XXX XXX XXX $ b $ DDD4 XXX XXX XXX $ b $ DDD5 XXX XXX
DDD6 XXX XXX XXX
DDD7 XXX XXX XXX

我有第二张表,给出为

1 AAA
2 DDD
3 CCC

由于AAA,DDD和CCC在表2中,我想扫描所有对应的行这3个值来自table1,即我想得到AAA1-AAA8,DDD1 -DDD7和CCC1-CCC7。
table1是一个Hadoop数据库表,并有数百万行。我只想扫描它从我得到的table1行。
任何人都可以用这种方法帮助我吗?我有table1作为RDD,而table2在HBase中。

解决方案

实际上,难度在于设置HBase连接器 Hortonworks 或从华为



但无论如何,我认为你询问查询本身,所以我很快用Hive构建了一个玩具示例(即,使用shell创建HBase表,然后在Hive中添加 create external table >)。然后我使用Hive上下文创建一个SQL上下文。

pre $

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

3行:

  df = sqlContext.sql(select * from hbase_table_1)
df.show(3)
+ ---- + -------- +
|关键| COLUMN_1 |
+ ---- + -------- +
| AAA1 | ABCD |
| AAA2 | EFGH |
| BBB1 | JKLM |
+ ---- + -------- +

以及访问HBase的一个子集 rowkeys

  df = sqlContext.sql (select * from hbase_table_1 where key> ='AAA'and key<'BBB')
df.show(3)
+ ---- + ------- - +
|关键| COLUMN_1 |
+ ---- + -------- +
| AAA1 | ABCD |
| AAA2 | EFGH |
+ ---- + -------- +

你应该明确地去找一个HBase连接器,但是一旦你有了它(至少对于Hortonworks来说),查询应该是一样的。


So if there is a table in the database shown as below:


Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX AAA3 XXX XXX XXX AAA4 XXX XXX XXX AAA5 XXX XXX XXX AAA6 XXX XXX XXX AAA7 XXX XXX XXX AAA8 XXX XXX XXX BBB1 XXX XXX XXX BBB2 XXX XXX XXX BBB3 XXX XXX XXX BBB4 XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX CCC2 XXX XXX XXX CCC3 XXX XXX XXX CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX DDD2 XXX XXX XXX DDD3 XXX XXX XXX DDD4 XXX XXX XXX DDD5 XXX XXX XXX DDD6 XXX XXX XXX DDD7 XXX XXX XXX I have a 2nd table, given as 1 AAA 2 DDD 3 CCC Since AAA,DDD and CCC are in table2, I want to scan all rows corresponding to these 3 values from table1, i.e, I want to get AAA1-AAA8, DDD1 -DDD7 and CCC1-CCC7. The table1 is a Hadoop database table, and has millions of rows. I only want to scan it for rows which I get from table1. Could anyone help me with an efficent way of doing this? I have table1 as an RDD, and table2 is in HBase.

解决方案

The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.

But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a create external table in Hive).

Then I create a SQL context using the Hive context.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

The full toy table has 3 rows:

df = sqlContext.sql("select * from hbase_table_1")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
|BBB1|    jklm|
+----+--------+

and to access a subset of the HBase rowkeys:

df = sqlContext.sql("select * from hbase_table_1 where key >= 'AAA' and key < 'BBB'")
df.show(3)
+----+--------+
| key|column_1|
+----+--------+
|AAA1|    abcd|
|AAA2|    efgh|
+----+--------+

For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.

这篇关于使用RDD中的索引扫描Spark中的Hadoop数据库表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆