在R中获取最频繁的元素 [英] Getting the most frequent element in a factor in R
问题描述
我在R变量中有一组字符串,当我检查类时,它说这是一个因素. 例如.
I have a set of strings in a R variable, when I check the class, it says it is a factor. eg.
mySet<-c("abc","abc","def","abc","def","efg","abc")
我想获取在该集合中出现次数最多的字符串(在这种情况下为"abc").
I want to get the string which occurs the maximum number of times in this set(i.e."abc" in this case).
我知道一种方法是使用hist()
,但是我遇到了数据类型问题,并且由于我是R语言的新手,所以我自己无法破解这一方法.
I understand one approach is to use the hist()
but I am facing data type issues and since I'm new to R I wasn't able to crack this one by myself.
推荐答案
根据数据的大小和执行此练习的频率,您可能需要花费一些时间来编写更有效的函数. table
的下标是tabulate
,它要快得多,因此可以导致如下功能:
Depending on the size of your data and the frequency at which you need to do such an exercise, you might want to spend some time writing a more efficient function. Underlying table
is tabulate
, which is much faster, and can thus lead to a function like the following:
MaxTable <- function(InVec, mult = FALSE) {
if (!is.factor(InVec)) InVec <- factor(InVec)
A <- tabulate(InVec)
if (isTRUE(mult)) {
levels(InVec)[A == max(A)]
}
else levels(InVec)[which.max(A)]
}
此功能还旨在识别何时存在多个最大值.比较以下内容:
This function is designed to also identify when there are multiple values for the max values. Compare the following:
mySet <- c("A", "A", "A", "B", "B", "B", "C", "C")
## Your question indicates that you have factors,
## but your sample code is a character vector
mySetF <- factor(mySet) ## Just as an example
## @BrodieG's answer
fun1 <- function(InVec) {
names(which.max(table(InVec)))
}
## @sgibb's answer
fun2 <- function(InVec) {
m <- which.max(table(as.character(InVec)))
as.character(InVec)[m]
}
fun1(mySet)
# [1] "A"
fun2(mySet)
# [1] "A"
MaxTable(mySet)
# [1] "A"
MaxTable(mySet, mult = TRUE)
# [1] "A" "B"
library(microbenchmark)
microbenchmark(fun1(mySet), fun2(mySet), MaxTable(mySet), MaxTable(mySetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(mySet) 291.457 297.1845 302.2080 313.1235 3008.108 100
# fun2(mySet) 296.388 302.0775 311.3170 321.5260 1367.137 100
# MaxTable(mySet) 172.463 180.8755 184.8355 189.9700 1947.700 100
# MaxTable(mySetF) 34.510 38.1545 44.6045 46.6695 95.341 100
在小向量级别,此功能效率更高.对于factor
向量,这一点甚至更加明显.更大的向量呢?
At the small vector level, this function is more efficient. This is even more obvious with factor
vectors. How about with bigger vectors?
set.seed(1)
medSet <- sample(c(LETTERS, letters), 1e5, TRUE)
medSetF <- factor(medSet)
fun1(medSet)
# [1] "E"
fun2(medSet) ### Wrong Answer!!!
# [1] "D"
MaxTable(medSet)
# [1] "E"
microbenchmark(fun1(medSet), MaxTable(medSet), MaxTable(medSetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(medSet) 14222.846 14350.957 14484.4490 14600.490 34810.174 100
# MaxTable(medSet) 7787.761 7860.248 7917.3455 8019.068 9762.884 100
# MaxTable(medSetF) 501.733 529.257 570.0735 587.936 1469.994 100
由于它返回了错误的答案,因此我已将@sgibb的功能从基准中删除(它的运行时间与fun1()
大致相同).
I've dropped @sgibb's function from the benchmarks (it runs in about the same time as fun1()
) since it returns the wrong answer.
最后一个基准....
set.seed(3)
bigSet <- sample(c(LETTERS, letters), 1e7, TRUE)
bigSetF <- factor(bigSet)
microbenchmark(fun1(bigSet), MaxTable(bigSet), MaxTable(bigSetF), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(bigSet) 1519.37503 1612.10290 1648.36473 1789.02965 1932.41073 10
# MaxTable(bigSet) 782.01856 791.86408 834.35764 894.60535 1019.28747 10
# MaxTable(bigSetF) 48.56459 48.76492 49.25444 49.93911 50.20404 10
这篇关于在R中获取最频繁的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!