我怎么可以转换成groupedData数据框R中 [英] How can I convert groupedData into Dataframe in R

查看:466
本文介绍了我怎么可以转换成groupedData数据框R中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑我有以下数据框中

AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12

我想它基于ACCOUNTID组,然后我想补充另一列命名date_diff其中将包含当前行和previous排之间的CloseDate的差异。请注意,我只希望对具有相同ACCOUNTID行来计算这个date_diff。所以,我需要对数据进行分组添加另一列之前

I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need to group the data before adding another column

下面为R code,我现在用

Below is the R code that I am using

  df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
  df$CloseDate <- to_date(df$CloseDate)
  groupedData <- SparkR::group_by(df, df$AccountId)
  SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))

要加我使用发生变异另一列。但随着GROUP_BY返回groupedData我无法在这里使用发生变异。我提示以下错误:

To add another column I am using mutate. But as the group_by returns groupedData I am not able to use mutate here. I am getting the below error

 Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’

那么,如何我可以转换成GroupedData数据框,这样我可以用变异添加列?

So how can I convert GroupedData into Dataframe so that I can add columns using mutate?

推荐答案

您想要什么是不可能的使用实现 GROUP_BY 。作为SO已经解释了好几次:

What you want is not possible to achieve using group_by. As already explained quite a few times on SO :

  • Using groupBy in Spark and getting back to a DataFrame
  • How to do custom operations on GroupedData in Spark?
  • DataFrame groupBy behaviour/optimization

GROUP_BY 数据帧没有物理组的数据。应用 GROUP_BY 之后的操作顺序而且是不确定的。

group_by on a DataFrame doesn't physical group the data. Moreover order of operations after applying group_by is nondeterministic.

要实现所需的输出,你将不得不使用窗口功能,并提供一个明确的顺序:

To achieve desired output you'll have to use window functions and provide an explicit ordering:

df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 
  3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L, 
  5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02", 
  "2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")), 
  .Names = c("AccountId", "CloseDate"),
  class = "data.frame", row.names = c(NA, -12L))

hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")

query <- "SELECT *, LAG(CloseDate, 1) OVER (
  PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"

dfWithLag <- sql(hiveContext, query)

withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
  head()

##   AccountId  CloseDate    DateLag diff
## 1         1 2015-05-07       <NA>   NA
## 2         1 2015-05-09 2015-05-07    2
## 3         1 2015-05-12 2015-05-09    3
## 4         1 2015-05-12 2015-05-12    0
## 5         2 2015-05-09       <NA>   NA
## 6         2 2015-05-12 2015-05-09    3

这篇关于我怎么可以转换成groupedData数据框R中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆