附加到数据帧列的Apache 1.3星火 [英] Append a column to Data Frame in Apache Spark 1.3

查看:139
本文介绍了附加到数据帧列的Apache 1.3星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可能,这将是一列添加到数据帧中的最有效的方法整齐?

更具体地说,列可以作为行ID为现有的数据帧。

在简化的情况下,从文件中读取,而不是它标记化,我能想到的,如下(Scala中)的东西,但它有错误完成(在3号线),以及反正看起来并不像可能的最佳路线

  VAR dataDF = sc.textFile(路径/文件)。toDF()
VAL rowDF = sc.parallelize(1至DataDF.count()。toInt).toDF(ID)
dataDF = dataDF.withColumn(ID,rowDF(ID))


解决方案

这已经有一段时间,因为我张贴的问题,似乎有些其他人想获得的答案为好。下面是我发现了什么。

所以原来的任务是添加具有行identificators列(基本上,一个序列 1至其行)以任何给定的数据帧,所以行命令/ presence可跟踪(例如,当你品尝)。这可以通过一些沿这些线来实现:

  sqlContext.textFile(文件)。
zipWithIndex()。
地图(案(D,I)=> i.toString +定界符+ D)。
地图(_。拆分(分隔符))。
图(S = GT; Row.fromSeq(s.toSeq))

关于追加到任何数据帧中的任意列的一般情况:

在星火API的最接近这个功能是 withColumn withColumnRenamed 。据斯卡拉文档,前者的返回通过添加一列新的数据帧的。在我看来,这是一个有点混乱和不完整的定义。这两个功能都可以在这个数据帧只,即给出了两个数据帧 DF1 DF2 与列山坳

  VAL DF = df1.withColumn(NEWCOL,DF1(COL)+ 1)//  - 确定
VAL DF = df1.withColumn(NEWCOL,DF2(COL)+ 1)// - FAIL

所以,除非你可以管理现有的数据帧进行改造一栏你需要,你不能使用的形状 withColumn withColumnRenamed 以追加任意列(独立或其他数据帧)。

由于这是上述评论的,解决方法解决办法可能是使用加入 - 这将是pretty凌乱,虽然有可能 - 附加像上面的唯一键与 zipWithIndex 这两个数据帧或列可能会奏效。尽管其效率......

很明显,追加一列中的数据帧不是分布式环境中轻松的功能和有可能不是很高效,整洁的方法在所有。但我认为,它仍然是非常重要的是要有可用的这个核心功能,即使性能警告。

Is it possible and what would be the most efficient neat method to add a column to Data Frame?

More specifically, column may serve as Row IDs for the existing Data Frame.

In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:

var dataDF = sc.textFile("path/file").toDF() 
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID") 
dataDF = dataDF.withColumn("ID", rowDF("ID")) 

解决方案

It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.

So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:

sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))

Regarding the general case of appending any column to any data frame:

The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:

val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL

So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).

As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...

It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.

这篇关于附加到数据帧列的Apache 1.3星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆