PHOENIX SPARK - 将表载入DataFrame [英] PHOENIX SPARK - Load Table as DataFrame

查看:642
本文介绍了PHOENIX SPARK - 将表载入DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从HBase表(PHOENIX)创建了一个DataFrame,它具有5亿行。从DataFrame中,我创建了一个JavaBean的RDD,并使用它来加入来自文件的数据。

  Map< String,String> phoenixInfoMap = new HashMap< String,String>(); 
phoenixInfoMap.put(table,tableName);
phoenixInfoMap.put(zkUrl,zkURL);
DataFrame df = sqlContext.read()。format(org.apache.phoenix.spark)。options(phoenixInfoMap).load();
JavaRDD< Row> tableRows = df.toJavaRDD();
JavaPairRDD< String,AccountModel> dbData = tableRows.mapToPair(
new PairFunction< Row,String,String>()
{
@Override
public Tuple2< String,String> call(Row row)throws Exception
{
return new Tuple2< String,String>(row.getAs(ID),row.getAs(NAME));
}
});

现在我的问题 - 让我们说这个文件有两个独特的百万条目匹配表。整个表格作为RDD加载到内存中,还是仅将匹配的200万条记录作为RDD加载到内存中?

解决方案

p>你的陈述

  DataFrame df = sqlContext.read()。format(org.apache.phoenix.spark)。 options(phoenixInfoMap)
.load();

将整个表加载到内存中。您没有为phoenix提供任何过滤器来推送到hbase,从而减少读取的行数。



如果您加入非HBase数据源(例如平面文件),那么首先需要读取hbase表中的所有记录。不符合辅助数据源的记录不会保存在新的DataFrame中 - 但初始读数仍然会发生。



更新一个潜在的方法是预处理该文件,即提取您想要的ID。将结果存储到新的HBase表中。然后通过Phoenix 直接在HBase中执行加入,而不是Spark



该方法的原理是将计算移动到数据。大部分数据驻留在HBase中 - 因此,将小数据(文件中的id)移动到那里。



我不熟悉Phoenix,除了它在hbase之上提供了一个sql层。那么那么它可以做一个这样的连接,并将结果存储在一个单独的HBase表中?然后可以将单独的表加载到Spark中,以便在后续的计算中使用。


I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.

Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
    @Override
    public Tuple2<String, String> call(Row row) throws Exception
    {
        return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
    }
});

Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?

解决方案

Your statement

DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();

will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.

If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.

Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .

The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.

I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

这篇关于PHOENIX SPARK - 将表载入DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆