我怎么可以转换成groupedData数据框R中 [英] How can I convert groupedData into Dataframe in R
问题描述
考虑我有以下数据框中
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
我想它基于ACCOUNTID组,然后我想补充另一列命名date_diff其中将包含当前行和previous排之间的CloseDate的差异。请注意,我只希望对具有相同ACCOUNTID行来计算这个date_diff。所以,我需要对数据进行分组添加另一列之前
I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need to group the data before adding another column
下面为R code,我现在用
Below is the R code that I am using
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
要加我使用发生变异另一列。但随着GROUP_BY返回groupedData我无法在这里使用发生变异。我提示以下错误:
To add another column I am using mutate. But as the group_by returns groupedData I am not able to use mutate here. I am getting the below error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
那么,如何我可以转换成GroupedData数据框,这样我可以用变异添加列?
So how can I convert GroupedData into Dataframe so that I can add columns using mutate?
推荐答案
您想要什么是不可能的使用实现 GROUP_BY
。作为SO已经解释了好几次:
What you want is not possible to achieve using group_by
. As already explained quite a few times on SO :
- Using groupBy in Spark and getting back to a DataFrame
- How to do custom operations on GroupedData in Spark?
- DataFrame groupBy behaviour/optimization
GROUP_BY
在数据帧
没有物理组的数据。应用 GROUP_BY
之后的操作顺序而且是不确定的。
group_by
on a DataFrame
doesn't physical group the data. Moreover order of operations after applying group_by
is nondeterministic.
要实现所需的输出,你将不得不使用窗口功能,并提供一个明确的顺序:
To achieve desired output you'll have to use window functions and provide an explicit ordering:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3
这篇关于我怎么可以转换成groupedData数据框R中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!