在R中获取最频繁的元素 [英] Getting the most frequent element in a factor in R

查看:132
本文介绍了在R中获取最频繁的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R变量中有一组字符串,当我检查类时,它说这是一个因素. 例如.

I have a set of strings in a R variable, when I check the class, it says it is a factor. eg.

mySet<-c("abc","abc","def","abc","def","efg","abc")

我想获取在该集合中出现次数最多的字符串(在这种情况下为"abc").

I want to get the string which occurs the maximum number of times in this set(i.e."abc" in this case).

我知道一种方法是使用hist(),但是我遇到了数据类型问题,并且由于我是R语言的新手,所以我自己无法破解这一方法.

I understand one approach is to use the hist() but I am facing data type issues and since I'm new to R I wasn't able to crack this one by myself.

推荐答案

根据数据的大小和执行此练习的频率,您可能需要花费一些时间来编写更有效的函数. table的下标是tabulate,它要快得多,因此可以导致如下功能:

Depending on the size of your data and the frequency at which you need to do such an exercise, you might want to spend some time writing a more efficient function. Underlying table is tabulate, which is much faster, and can thus lead to a function like the following:

MaxTable <- function(InVec, mult = FALSE) {
  if (!is.factor(InVec)) InVec <- factor(InVec)
  A <- tabulate(InVec)
  if (isTRUE(mult)) {
    levels(InVec)[A == max(A)]
  } 
  else levels(InVec)[which.max(A)]
}

此功能还旨在识别何时存在多个最大值.比较以下内容:

This function is designed to also identify when there are multiple values for the max values. Compare the following:

mySet <- c("A", "A", "A", "B", "B", "B", "C", "C")
## Your question indicates that you have factors,
##   but your sample code is a character vector
mySetF <- factor(mySet) ## Just as an example

## @BrodieG's answer
fun1 <- function(InVec) {
  names(which.max(table(InVec)))
}

## @sgibb's answer
fun2 <- function(InVec) {
  m <- which.max(table(as.character(InVec)))
  as.character(InVec)[m]
}

fun1(mySet)
# [1] "A"
fun2(mySet)
# [1] "A"
MaxTable(mySet)
# [1] "A"
MaxTable(mySet, mult = TRUE)
# [1] "A" "B"

library(microbenchmark)    
microbenchmark(fun1(mySet), fun2(mySet), MaxTable(mySet), MaxTable(mySetF))
# Unit: microseconds
#              expr     min       lq   median       uq      max neval
#       fun1(mySet) 291.457 297.1845 302.2080 313.1235 3008.108   100
#       fun2(mySet) 296.388 302.0775 311.3170 321.5260 1367.137   100
#   MaxTable(mySet) 172.463 180.8755 184.8355 189.9700 1947.700   100
#  MaxTable(mySetF)  34.510  38.1545  44.6045  46.6695   95.341   100

在小向量级别,此功能效率更高.对于factor向量,这一点甚至更加明显.更大的向量呢?

At the small vector level, this function is more efficient. This is even more obvious with factor vectors. How about with bigger vectors?

set.seed(1)
medSet <- sample(c(LETTERS, letters), 1e5, TRUE)
medSetF <- factor(medSet)

fun1(medSet)
# [1] "E"
fun2(medSet) ### Wrong Answer!!!
# [1] "D"
MaxTable(medSet)
# [1] "E"

microbenchmark(fun1(medSet), MaxTable(medSet), MaxTable(medSetF))
# Unit: microseconds
#               expr       min        lq     median        uq       max neval
#       fun1(medSet) 14222.846 14350.957 14484.4490 14600.490 34810.174   100
#   MaxTable(medSet)  7787.761  7860.248  7917.3455  8019.068  9762.884   100
#  MaxTable(medSetF)   501.733   529.257   570.0735   587.936  1469.994   100

由于它返回了错误的答案,因此我已将@sgibb的功能从基准中删除(它的运行时间与fun1()大致相同).

I've dropped @sgibb's function from the benchmarks (it runs in about the same time as fun1()) since it returns the wrong answer.

最后一个基准....

set.seed(3)
bigSet <- sample(c(LETTERS, letters), 1e7, TRUE)
bigSetF <- factor(bigSet)
microbenchmark(fun1(bigSet), MaxTable(bigSet), MaxTable(bigSetF), times = 10)
# Unit: milliseconds
#               expr        min         lq     median         uq        max neval
#       fun1(bigSet) 1519.37503 1612.10290 1648.36473 1789.02965 1932.41073    10
#   MaxTable(bigSet)  782.01856  791.86408  834.35764  894.60535 1019.28747    10
#  MaxTable(bigSetF)   48.56459   48.76492   49.25444   49.93911   50.20404    10

这篇关于在R中获取最频繁的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆