从数据框列获取最频繁的字符串 [英] Get most frequent string from a data frame column

查看:58
本文介绍了从数据框列获取最频繁的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用多行数据帧作为输入来返回n个最频繁出现的字符串。所有值都在同一列中,称为 MissingDates

I need to return the n most frequent occurrences of a string, using a multiple row data frame as the input. All the values are in the same column called "MissingDates"

这里是示例数据,总共约有5000行:

Here is sample data, in total there are about 5000 rows:

Symbol Count  MissingDates     
AD  27  1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11-29, 1996-12-09, 1996-12-20, 1996-12-23, 1996-12-26, 1996-12-27, 1997-01-02, 1997-05-02, 1997-09-10, 1998-01-02, 1998-04-16, 1998-12-08, 1999-12-27, 1999-12-31, 2001-09-12, 2003-08-06, 2003-10-13
BP  14  1995-08-09, 1995-08-15, 1995-12-26, 1996-01-02, 1996-09-06, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-09-12, 2002-12-24, 2003-08-06, 2003-10-13
C   3   1999-12-31, 2001-12-24, 2002-12-24
CC  285 1994-05-18, 1994-05-19, 1994-05-20, 1994-05-23, 1994-05-24, 1994-05-25, 1994-05-26, 1994-05-27, 1994-05-31, 1994-06-01, 1994-06-02, 1994-06-03, 1994-06-06, 1994-06-07, 1994-06-08, 1994-06-09, 1994-06-10, 1994-06-13, 1994-06-14, 1994-06-15, 1994-06-16, 1994-06-17, 1994-06-20, 1994-06-21, 1994-06-23, 1994-06-24, 1994-06-27, 1994-06-28, 1994-06-29, 1994-06-30, 1994-07-01, 1994-07-06, 1994-07-14, 1994-07-15, 1994-07-18, 1994-07-19, 1994-07-21, 1994-07-25, 1994-07-27, 1994-07-28, 1994-08-03, 1994-08-04, 1994-08-08, 1994-08-09, 1994-08-10, 1994-08-11, 1994-08-12, 1994-08-15, 1994-08-17, 1994-08-18, 1994-08-19, 1994-08-22, 1994-08-23, 1994-08-24, 1994-08-25, 1994-08-29, 1994-08-31, 1994-09-01, 1994-09-02, 1994-09-06, 1994-09-07, 1994-09-08, 1994-09-09, 1994-09-12, 1994-09-13, 1994-09-15, 1994-09-16, 1994-09-19, 1994-09-20, 1994-09-21, 1994-09-22, 1994-09-23, 1994-09-27, 1994-09-28, 1994-09-29, 1994-09-30, 1994-10-03, 1994-10-04, 1994-10-06, 1994-10-14, 1994-10-18, 1994-10-19, 1994-10-25, 1994-10-26, 1994-10-27, 1994-10-28, 1994-10-31, 1994-11-01, 1994-11-09, 1994-11-10, 1994-11-11, 1994-11-16, 1994-11-17, 1994-11-25, 1994-11-28, 1994-12-01, 1994-12-02, 1994-12-06, 1994-12-07, 1994-12-08, 1994-12-09, 1994-12-12, 1994-12-13, 1994-12-14, 1994-12-15, 1994-12-16, 1994-12-23, 1994-12-27, 1994-12-29, 1994-12-30, 1995-01-03, 1995-01-05, 1995-01-09, 1995-01-11, 1995-01-13, 1995-01-16, 1995-01-17, 1995-01-18, 1995-01-19, 1995-01-20, 1995-01-24, 1995-01-25, 1995-02-13, 1995-02-17, 1995-05-01, 1995-07-03, 1995-11-24, 1995-12-26, 1996-01-08, 1996-01-09, 1996-07-05, 1996-11-29, 1996-12-26, 1997-11-28, 1997-12-26, 1998-01-02, 1998-11-27, 1999-06-17, 1999-06-18, 1999-06-21, 1999-06-22, 1999-06-23, 1999-06-24, 1999-06-25, 1999-06-28, 1999-06-29, 1999-06-30, 1999-07-01, 1999-07-02, 1999-07-06, 1999-07-07, 1999-07-08, 1999-07-09, 1999-07-12, 1999-07-13, 1999-07-14, 1999-07-15, 1999-07-16, 1999-07-19, 1999-07-20, 1999-07-21, 1999-07-22, 1999-07-23, 1999-07-26, 1999-07-27, 1999-07-28, 1999-07-29, 1999-07-30, 1999-08-02, 1999-08-03, 1999-08-04, 1999-08-05, 1999-08-06, 1999-08-09, 1999-08-10, 1999-08-11, 1999-08-12, 1999-08-13, 1999-08-16, 1999-08-17, 1999-08-18, 1999-08-19, 1999-08-20, 1999-08-23, 1999-08-24, 1999-08-25, 1999-08-26, 1999-08-27, 1999-08-30, 1999-08-31, 1999-09-01, 1999-09-02, 1999-09-03, 1999-09-07, 1999-09-08, 1999-09-09, 1999-09-10, 1999-09-13, 1999-09-14, 1999-09-15, 1999-09-16, 1999-09-17, 1999-09-20, 1999-09-21, 1999-09-22, 1999-09-23, 1999-09-24, 1999-09-27, 1999-09-28, 1999-09-29, 1999-09-30, 1999-10-01, 1999-10-04, 1999-10-05, 1999-10-06, 1999-10-07, 1999-10-08, 1999-10-11, 1999-10-12, 1999-10-13, 1999-10-14, 1999-10-15, 1999-10-18, 1999-10-19, 1999-10-20, 1999-10-21, 1999-10-22, 1999-10-25, 1999-10-26, 1999-10-27, 1999-10-28, 1999-10-29, 1999-11-01, 1999-11-02, 1999-11-03, 1999-11-04, 1999-11-05, 1999-11-08, 1999-11-09, 1999-11-10, 1999-11-11, 1999-11-12, 1999-11-15, 1999-11-16, 1999-11-17, 1999-11-18, 1999-11-19, 1999-11-22, 1999-11-23, 1999-11-24, 1999-11-26, 1999-11-29, 1999-11-30, 1999-12-01, 1999-12-02, 1999-12-03, 1999-12-06, 1999-12-07, 1999-12-08, 1999-12-09, 1999-12-10, 1999-12-13, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-13, 2001-09-14, 2001-11-23, 2001-12-24, 2001-12-26, 2001-12-31, 2002-07-05, 2002-11-29, 2002-12-26, 2003-02-18, 2003-11-28, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24, 2011-01-03
CD  14  1995-08-09, 1995-12-26, 1996-01-02, 1996-06-11, 1996-06-20, 1996-09-09, 1996-09-11, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-01-02, 2001-09-12
CT  154 1995-11-24, 1996-01-08, 1996-07-05, 1996-11-29, 1996-12-24, 1997-11-28, 1997-12-26, 1998-11-27, 1999-11-26, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-11, 2001-09-12, 2001-09-13, 2001-09-14, 2001-11-12, 2001-11-23, 2001-12-24, 2001-12-31, 2002-05-21, 2002-05-22, 2002-05-23, 2002-05-24, 2002-05-28, 2002-05-29, 2002-05-30, 2002-05-31, 2002-06-03, 2002-06-04, 2002-06-05, 2002-06-06, 2002-06-07, 2002-06-10, 2002-06-11, 2002-06-12, 2002-06-13, 2002-06-14, 2002-06-17, 2002-06-18, 2002-06-19, 2002-06-20, 2002-06-21, 2002-06-24, 2002-06-25, 2002-06-26, 2002-06-27, 2002-06-28, 2002-07-01, 2002-07-02, 2002-07-03, 2002-07-05, 2002-07-08, 2002-07-09, 2002-07-10, 2002-07-11, 2002-07-12, 2002-07-15, 2002-07-16, 2002-07-17, 2002-07-18, 2002-07-19, 2002-07-22, 2002-07-23, 2002-07-24, 2002-07-25, 2002-07-26, 2002-07-29, 2002-07-30, 2002-07-31, 2002-08-01, 2002-08-02, 2002-08-05, 2002-08-06, 2002-08-07, 2002-08-08, 2002-08-09, 2002-08-12, 2002-08-13, 2002-08-14, 2002-08-15, 2002-08-16, 2002-08-19, 2002-08-20, 2002-08-21, 2002-08-22, 2002-08-23, 2002-08-26, 2002-08-27, 2002-08-28, 2002-08-29, 2002-08-30, 2002-09-03, 2002-09-04, 2002-09-05, 2002-09-06, 2002-09-09, 2002-09-10, 2002-09-11, 2002-09-12, 2002-09-13, 2002-09-16, 2002-09-17, 2002-09-18, 2002-09-19, 2002-09-20, 2002-09-23, 2002-09-24, 2002-09-25, 2002-09-26, 2002-09-27, 2002-09-30, 2002-10-01, 2002-10-02, 2002-10-03, 2002-10-04, 2002-10-07, 2002-10-08, 2002-10-09, 2002-10-10, 2002-10-11, 2002-10-14, 2002-10-15, 2002-10-16, 2002-10-17, 2002-10-18, 2002-10-21, 2002-10-22, 2002-10-23, 2002-10-24, 2002-10-25, 2002-10-28, 2002-10-29, 2002-10-30, 2002-10-31, 2002-11-01, 2002-11-04, 2002-11-05, 2002-11-06, 2002-11-07, 2002-11-29, 2002-12-24, 2003-02-18, 2003-11-28, 2003-12-26, 2004-01-02, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24

因此该函数将具有参数传递到哪里,它将从data.frame返回上面日期的n次最频繁出现。

So the function would have an argument passed where it would return the n most frequent occurrences of the dates above from the data.frame.

我查看了which.max,但无法计算列出了如何将其应用于多行(整个数据框列),或者为我提供多于一个日期(n)的输出。

I looked at which.max but haven't been able to figure out how to apply it to multiple rows (entire data frame column), or to give me more than a single date (n) as output.

如果仅使用一个输出值就可以使代码简单得多,那么作为我工作的起点,这是可以接受的。

If the code would be a great deal simpler with only a single output value, that is acceptable as a starting point for me to work from. Any pointers are appreciated.

这是一个pastebin,因为字符串的长度导致我遇到麻烦:
http://pastebin.com/B1YPicC8

Here is a pastebin as I'm having trouble due to the length of the strings: http://pastebin.com/B1YPicC8


> str(gaps)
    'data.frame':   5560 obs. of  3 variables:
     $ Symbol      : Factor w/ 5560 levels "@AD#","@BP#",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ Count       : int  27 14 3 285 14 154 540 11 3 11 ...
     $ MissingDates: Factor w/ 3568 levels "1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10 ...


推荐答案

似乎您需要以下内容:

功能

freqfunc <- function(x, n){
  tail(sort(table(unlist(strsplit(as.character(x), ", ")))), n)
}

测试数据集

freqfunc(gaps$MissingDates, 5) # Five most frequent dates

## 1996-12-26 1997-12-26 1998-01-02 1999-12-31 2001-09-12 
##          4          4          4          4          4 

这篇关于从数据框列获取最频繁的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆