行>从JavaRDD&LT创造火花数据帧;所有列的数据复制到第一列 [英] Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

查看:313
本文介绍了行>从JavaRDD&LT创造火花数据帧;所有列的数据复制到第一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我有一个数据帧,我需要转换成JavaRDD和回数据框我有以下的code

Hi I have a DataFrame which I need to convert into JavaRDD and back to DataFrame I have the following code

DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
    public Row call(Row row) throws Exception {
       if(row != null) {
           //updated row by creating new Row
           return RowFactory.create(updateRow);
       }
      return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);

sourceFrame modifiedFrame 模式是一样的,当我打电话 sourceFrame.show()产量预计我看到每列有相应的值,并没有列是空的,但是当我打电话 modifiedFrame.show()我看到所有列的值变合并成如第一列的值假定源数据框有3列如下图所示。

sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below

_col1    _col2    _col3
 ABC       10      DEF
 GHI       20      JKL

当我打印modifiedFrame这是我从JavaRDD转换它显示了以下顺序

When I print modifiedFrame which I converted from JavaRDD it shows in the following order

_col1        _col2      _col3
ABC,10,DEF
GHI,20,JKL

正如上面所有的_col1显示了所有的价值观和_col2和_col3是空的。我不知道什么是错的,我做的请指导我是新来的星火先谢谢了。

As shown above all the _col1 has all the values and _col2 and _col3 is empty. I dont know what is wrong I am doing please guide I am new to Spark thanks in advance.

推荐答案

正如我在问题的评论中提及;

As I mentioned in question's comment ;

它可能发生的,因为给人列表作为一个参数。

It might occurs because of giving list as a one parameter.

return RowFactory.create(updateRow);

在调查的Apache星火文档和源$ C ​​$ CS;在<一个href=\"http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema\"相对=nofollow>指定架构例如,分别对所有列分配参数一个接一个。只是探讨一些源$ C ​​$ C大致<一个href=\"https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java\"相对=nofollow> RowFactory.java 阶级和阶级GenericRow不分配的一个参数。因此,尝试,分别得到参数行的列的。

When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.

return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example

您可以尝试到您的列表转换为数组,然后作为参数传递。

You may try to convert your list to array and then pass as a parameter.

YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);

顺便说一句RowFactory.create()方法创建行对象。 大约在Row对象的Apache星火文档和RowFactory.create()方法;

重新presents从关系运算符输出的一行。允许按序号都通用接入模式,这将招致拳击开销
  基元,以及原生原语的访问。它是无效的使用
  原生原语接口检索的值是空值,
  而不是用户必须在尝试检索前检查isNullAt
  值可能为空。

Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null.

要创建一个新的行,请使用RowFactory.create()在Java或Row.apply()的
  斯卡拉。

To create a new Row, use RowFactory.create() in Java or Row.apply() in Scala.

行对象可以通过提供字段的值来构建。例如:

A Row object can be constructed by providing field values. Example:

进口org.apache.spark.sql ._

import org.apache.spark.sql._

//从值创建一个行。

// Create a Row from values.

行(值1,值2,值3,...)

Row(value1, value2, value3, ...)

//从值序列创建一个行。

// Create a Row from a Seq of values.

Row.fromSeq(SEQ(值1,值2,...))

Row.fromSeq(Seq(value1, value2, ...))

据文件;您也可以使用自己所需的算法,以单独的行列,同时创造行分别对象。但我想转换列表阵列和通参数为数组会为你工作(我不能尝试请发表您的反馈,谢谢)。

According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

这篇关于行&GT;从JavaRDD&LT创造火花数据帧;所有列的数据复制到第一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆