汇总数据帧每一列的所有唯一值 [英] Aggregating all unique values of each column of data frame

查看:80
本文介绍了汇总数据帧每一列的所有唯一值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框(1616610行,255列),我需要根据一个键将每列的唯一值粘贴在一起。

I have a large data frame (1616610 rows, 255 columns) and I need to paste together the unique values of each column based on a key.

例如:

> data = data.frame(a=c(1,1,1,2,2,3),
              b=c("apples", "oranges", "apples", "apples", "apples", "grapefruit"),
              c=c(12, 22, 22, 45, 67, 28), 
              d=c("Monday", "Monday", "Monday", "Tuesday", "Wednesday", "Tuesday"))
> data
  a          b  c         d
1 1     apples 12    Monday
2 1    oranges 22    Monday
3 1     apples 22    Monday
4 2     apples 45   Tuesday
5 2     apples 67 Wednesday
6 3 grapefruit 28   Tuesday

我需要的是汇总每个值中的每个唯一值255列,并为每个唯一值返回一个带有逗号分隔符的新数据框。像这样:

What I need is to aggregate each unique value in each of the 255 columns, and return a new data frame with comma separators for each unique value. Like this:

  a               b      c                  d
1 1 apples, oranges 12, 22             Monday
2 2          apples 45, 67 Tuesday, Wednesday
3 3      grapefruit     28           Thursday

我尝试使用聚合,如下所示:

I have tried using aggregate, like so:

output <- aggregate(data, by=list(data$a), paste, collapse=", ")

,但对于这种大小的数据框,这太费时间(数小时),常常我不得不一起杀死整个过程。最重要的是,这将汇总所有值,而不仅仅是唯一值。有人在以下方面有任何建议吗?

but for a data frame this size, it has been too time-intensive (hours), and often times I have to kill the process all together. On top of that, this will aggregate all values and not only the unique ones. Does anyone have any tips on:

1)如何缩短大型数据集的聚合时间

1) how to improve the time of this aggregation for large data sets

2),然后获得每个字段的唯一值

2) then get the unique values of each field

BTW,这是我的第一篇SO文章,非常感谢您的耐心配合。

BTW, this is my first post on SO, so thanks for your patience.

推荐答案

已从评论中删除:

library(data.table)

dt <- as.data.table(data)
dt[, lapply(.SD, function(x) toString(unique(x))), by = a]

给予:

   a               b      c                  d
1: 1 apples, oranges 12, 22             Monday
2: 2          apples 45, 67 Tuesday, Wednesday
3: 3      grapefruit     28            Tuesday

这篇关于汇总数据帧每一列的所有唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆