R-快速模式函数,用于data.table [,lapply(.SD,Mode),by =.()] [英] R - Fast Mode Function for use in data.table[,lapply(.SD,Mode),by=.()]

查看:166
本文介绍了R-快速模式函数,用于data.table [,lapply(.SD,Mode),by =.()]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在按组汇总data.table中的数据,在这里我需要在组中获取变量的单个值.我希望此值成为组的模式.我认为它应该是模式,因为通常一组是8行,一个值有2行,而另外6个左右的行是另一个值.

I'm summarizing data in a data.table, group by, where I need to take a single value of a variable in a group. I want this value to be the mode of the group. I think it needs to be mode because usually a group is 8 rows and it will have 2 rows at one value and the other 6 or so rows will be another value.

这是一个简化的示例,

key1 2
key1 2
key1 2
key1 8
key1 2
key1 2
key1 2
key1 8

我想要这个:

key1 2

我在使用base R提供的标准模式功能时遇到了麻烦,因此我在这里使用了此解决方案: 按组划分的最常见值(模式)

I was having trouble using the standard mode function provided by base R, so I used this solution here: Most frequent value (mode) by group

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

它在我的小型测试数据集上效果很好,但是当我在实际数据集(2200万行)上运行它时,它就可以运行了.我所有其他与相似相似的data.table操作都可以很好并且非常快地运行,但是我没有使用UDF.这是我的data.table查询的结构:

It worked great on my small test data set, but when I run it on my actual data set (22 million rows) it just runs and runs and runs. All my other data.table operations that are similar work great and really fast, but I'm not using a UDF. This is the structure of my data.table query:

ModeCharacterColumns <- ExposureHistory[,lapply(.SD,Mode), .(Key1=Key1, Key2=Key2, ..., key7=key7, key8=key8), .SDcols=('col1','col2','col3', ..., 'col53')]

所以我想我的问题是我的UDF确实在减慢速度,有人在我可以实现相同目标但可以更快地完成目标方面有什么建议吗?

So I'm guessing my problem is that my UDF is really slowing things down, does anyone have any suggestions where I can accomplish the same goal but get it done much quicker?

谢谢大家!

更好地表示数据:

DT <- fread("key1A key2A key3A key4A 2 2 4 s
             key1A key2A key3A key4A 2 2 4 s  
             key1A key2A key3A key4A 8 8 8 t
             key1A key2A key3A key4A 2 2 4 s
             key1B key2B key3B key4B 6 6 6 v
             key1B key2B key3B key4B 2 2 5 t
             key1B key2B key3B key4B 2 2 5 v
             key1B key2B key3B key4B 2 2 5 v")

以及所需的结果:

result <- fread("key1A key2A key3A key4A 2 2 4 s
                 key1B key2B key3B key4B 2 2 5 v")

推荐答案

尝试使用data.table将数据制成表格:

Try using data.table to tabulate the data:

DT <- fread("key1 8
             key1 2
             key1 2
             key1 8
             key1 2
             key1 2
             key1 2
             key1 8")

setkeyv(
  DT[, .N, by = .(V1, V2)], #tabulate
  c("V1", "N") #sort by N
   )[, .(Mode = V2[.N]), by = V1] #most frequent value by V1
#     V1 Mode
#1: key1    2

您需要仔细考虑抢七局.我可能实际上使用了for循环将其应用于更多的值列,但是如果您想让我尝试使用,则需要提供一个可重复的代表性示例.

You need to consider tie-breaking carefully. I might actually use a for loop to apply this to more value columns, but you'd need to provide a representative reproducible example if you want me to try that.

弗兰克(Frank)提供了一种对注释中的多个值列执行此操作的选项:

Frank provides one option of doing this for several value columns in a comment:

DT[, lapply(.SD, function(x) setDT(list(x = x))[, .N, by=x][order(-N)][1L, x]), by=V1]

但是,我相信这会复制每个值列,这可能会使它减慢太多.

However, I believe this copies every value column, which might slow it down too much.

这篇关于R-快速模式函数,用于data.table [,lapply(.SD,Mode),by =.()]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆