从数据框列中获取最频繁的字符串 [英] Get most frequent string from a data frame column

查看:17
本文介绍了从数据框列中获取最频繁的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要返回一个字符串最常出现的 n 个,使用多行数据框作为输入.所有值都在名为MissingDates"的同一列中

这里是示例数据,总共有大约 5000 行:

符号计数 MissingDates公元 27 年 1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-091996-10-11, 1996-11-13, 1996-11-29, 1996-12-09, 1996-12-20, 1996-12-23, 1996-12-26, 1996-19-27-1,01-02, 1997-05-02, 1997-09-10, 1998-01-02, 1998-04-16, 1998-12-08, 1999-12-27, 1999-12-31, 2001-012, 2003-08-06, 2003-10-13BP 14 1995-08-09、1995-08-15、1995-12-26、1996-01-02、1996-09-06、1996-12-26、1997-01-02、1996-121998-01-02, 1998-04-16, 2001-09-12, 2002-12-24, 2003-08-06, 2003-10-13C 3 1999-12-31, 2001-12-24, 2002-12-24CC 285 1994-05-18、1994-05-19、1994-05-20、1994-05-23、1994-05-24、1994-05-25、1994-05-26、1994-051994-05-31, 1994-06-01, 1994-06-02, 1994-06-03, 1994-06-06, 1994-06-07, 1994-06-08, 1994-06-04-1994-09-0906-10, 1994-06-13, 1994-06-14, 1994-06-15, 1994-06-16, 1994-06-17, 1994-06-20, 1994-06-21, 1994-023、1994-06-24、1994-06-27、1994-06-28、1994-06-29、1994-06-30、1994-07-01、1994-07-06、1994-07-1994-07-071994-07-15, 1994-07-18, 1994-07-19, 1994-07-21, 1994-07-25, 1994-07-27, 1994-07-28, 1994-08-04-1994-08-04-108-04, 1994-08-08, 1994-08-09, 1994-08-10, 1994-08-11, 1994-08-12, 1994-08-15, 1994-08-17, 1994-018, 1994-08-19, 1994-08-22, 1994-08-23, 1994-08-24, 1994-08-25, 1994-08-29, 1994-08-31, 1994-09-091994-09-02, 1994-09-06, 1994-09-07, 1994-09-08, 1994-09-09, 1994-09-12, 1994-09-13, 1994-09-15-15,09-16, 1994-09-19, 1994-09-20, 1994-09-21, 1994-09-22, 1994-09-23, 1994-09-27, 1994-09-28, 1994-29, 1994-09-30, 1994-10-03, 1994-10-04, 1994-10-06, 1994-10-14, 1994-10-18, 1994-10-19, 1994-10-25, 1994-10-26, 1994-10-27, 1994-10-28, 1994-10-31, 1994-11-01, 1994-11-09, 1994-11-10, 1994-11-11-111994-11-16, 1994-11-17, 1994-11-25, 1994-11-28, 1994-12-01, 1994-12-02, 1994-12-06, 1994-19-04-1,12-08, 1994-12-09, 1994-12-12, 1994-12-13, 1994-12-14, 1994-12-15, 1994-12-16, 1994-12-23, 1994-127, 1994-12-29, 1994-12-30, 1995-01-03, 1995-01-05, 1995-01-09, 1995-01-11, 1995-01-13, 1995-01-01-1995-01-011995-01-17, 1995-01-18, 1995-01-19, 1995-01-20, 1995-01-24, 1995-01-25, 1995-02-13, 1995-02-15-17,05-01, 1995-07-03, 1995-11-24, 1995-12-26, 1996-01-08, 1996-01-09, 1996-07-05, 1996-11-29, 1996-126、1997-11-28、1997-12-26、1998-01-02、1998-11-27、1999-06-17、1999-06-18、1999-06-21、1999-06-061999-06-23, 1999-06-24, 1999-06-25, 1999-06-28, 1999-06-29, 1999-06-30, 1999-07-01, 1999-07-09-1999-07-09-107-06, 1999-07-07, 1999-07-08, 1999-07-09, 1999-07-12, 1999-07-13, 1999-07-14, 1999-07-15, 1999-016, 1999-07-19, 1999-07-20, 1999-07-21, 1999-07-22, 1999-07-23, 1999-07-26, 1999-07-27, 1999-07-221999-07-29, 1999-07-30, 1999-08-02, 1999-08-03, 1999-08-04, 1999-08-05, 1999-08-06, 1999-08-09-1999-08-09,08-10, 1999-08-11, 1999-08-12, 1999-08-13, 1999-08-16, 1999-08-17, 1999-08-18, 1999-08-19, 1999-020, 1999-08-23, 1999-08-24, 1999-08-25, 1999-08-26, 1999-08-27, 1999-08-30, 1999-08-31, 1999-09-09-091999-09-02, 1999-09-03, 1999-09-07, 1999-09-08, 1999-09-09, 1999-09-10, 1999-09-13, 1999-09-19-14,09-15, 1999-09-16, 1999-09-17, 1999-09-20, 1999-09-21, 1999-09-22, 1999-09-23, 1999-09-24, 1999-027, 1999-09-28, 1999-09-29, 1999-09-30, 1999-10-01, 1999-10-04, 1999-10-05, 1999-10-06, 1999-10-10-101999-10-08, 1999-10-11, 1999-10-12, 1999-10-13, 1999-10-14, 1999-10-15, 1999-10-18, 1999-10-19-19, 1999-10-1910-20, 1999-10-21, 1999-10-22, 1999-10-25, 1999-10-26, 1999-10-27, 1999-10-28, 1999-10-29, 1999-101, 1999-11-02, 1999-11-03, 1999-11-04, 1999-11-05, 1999-11-08, 1999-11-09, 1999-11-10, 1999-11-11-111999-11-12, 1999-11-15, 1999-11-16, 1999-11-17, 1999-11-18, 1999-11-19, 1999-11-22, 1999-11-29, 1999-11-29-11-24, 1999-11-26, 1999-11-29, 1999-11-30, 1999-12-01, 1999-12-02, 1999-12-03, 1999-12-06, 1999-1-07、1999-12-08、1999-12-09、1999-12-10、1999-12-13、1999-12-31、2000-07-03、2000-11-24、2001-09, 2001-09-14, 2001-11-23, 2001-12-24, 2001-12-26, 2001-12-31, 2002-07-05, 2002-11-29, 2002-12-26-02-18, 2003-11-28, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-1-23, 2007-12-24, 2011-01-03CD 14 1995-08-09, 1995-12-26, 1996-01-02, 1996-06-11, 1996-06-20, 1996-09-09, 1996-09-11, 1996-12-2-21997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-01-02, 2001-09-12CT 154 1995-11-24、1996-01-08、1996-07-05、1996-11-29、1996-12-24、1997-11-28、1997-12-26、1998-111999-11-26, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-11, 2001-09-12, 2001-09-13, 2001-09-11-2, 2001-09-11-211-12, 2001-11-23, 2001-12-24, 2001-12-31, 2002-05-21, 2002-05-22, 2002-05-23, 2002-05-24, 2002-528, 2002-05-29, 2002-05-30, 2002-05-31, 2002-06-03, 2002-06-04, 2002-06-05, 2002-06-06, 2002-06-06-062002-06-10, 2002-06-11, 2002-06-12, 2002-06-13, 2002-06-14, 2002-06-17, 2002-06-18, 2002-06-12-2002-06-19, 2002-06-1906-20, 2002-06-21, 2002-06-24, 2002-06-25, 2002-06-26, 2002-06-27, 2002-06-28, 2002-07-01, 2002-002, 2002-07-03, 2002-07-05, 2002-07-08, 2002-07-09, 2002-07-10, 2002-07-11, 2002-07-12, 2002-07-07-072002-07-16, 2002-07-17, 2002-07-18, 2002-07-19, 2002-07-22, 2002-07-23, 2002-07-24, 2002-07-25-2, 2002-07-2507-26, 2002-07-29, 2002-07-30, 2002-07-31, 2002-08-01, 2002-08-02, 2002-08-05, 2002-08-06, 2002-007、2002-08-08、2002-08-09、2002-08-12、2002-08-13、2002-08-14、2002-08-15、2002-08-16、2002-08-0819、2002-08-20、2002-08-21、2002-08-22、2002-08-23、2002-08-26、2002-08-27、2002-08-28、2002-08-082002-08-30, 2002-09-03, 2002-09-04, 2002-09-05, 2002-09-06, 2002-09-09, 2002-09-10, 2002-09-12-2, 2002-09-12-09-12、2002-09-13、2002-09-16、2002-09-17、2002-09-18、2002-09-19、2002-09-20、2002-09-23、2002-1824、2002-09-25、2002-09-26、2002-09-27、2002-09-30、2002-10-01、2002-10-02、2002-10-03、2002-10-10-102002-10-07, 2002-10-08, 2002-10-09, 2002-10-10, 2002-10-11, 2002-10-14, 2002-10-15, 2002-10-16-2,10-17, 2002-10-18, 2002-10-21, 2002-10-22, 2002-10-23, 2002-10-24, 2002-10-25, 2002-10-28, 2002-229, 2002-10-30, 2002-10-31, 2002-11-01, 2002-11-04, 2002-11-05, 2002-11-06, 2002-11-07, 2002-11-042002-12-24, 2003-02-18, 2003-11-28, 2003-12-26, 2004-01-02, 2004-06-11, 2004-11-26, 2004-12-351-211-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24

因此,该函数将传递一个参数,该参数将从 data.frame 中返回上述日期中出现次数最多的 n 个.

我查看了 which.max,但无法弄清楚如何将其应用于多行(整个数据框列),或者给我多个日期 (n) 作为输出.

如果只有一个输出值的代码会简单得多,那么作为我工作的起点是可以接受的.任何指针表示赞赏.

这是一个 pastebin,因为我因为字符串的长度而遇到了麻烦:http://pastebin.com/B1YPicC8

<前>> str(间隙)'data.frame': 5560 obs.共 3 个变量:$ 符号:因子 w/5560 级别 "@AD#","@BP#",..: 1 2 3 4 5 6 7 8 9 10 ...$ 计数:int 27 14 3 285 14 154 540 11 3 11 ...$ MissingDates:包含 3568 个级别的因子1995-12-26、1996-01-02、1996-04-26、1996-04-30、1996-05-06、1996-08-26、1996-09-09, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11"|__截断__,..:1 2 3 4 5 6 7 8 9 10 ...

解决方案

您似乎需要以下内容:

功能

freqfunc <- function(x, n){尾(排序(表(unlist(strsplit(as.character(x),,")))),n)}

在您的数据集上进行测试

freqfunc(gaps$MissingDates, 5) # 五个最频繁的日期## 1996-12-26 1997-12-26 1998-01-02 1999-12-31 2001-09-12## 4 4 4 4 4

I need to return the n most frequent occurrences of a string, using a multiple row data frame as the input. All the values are in the same column called "MissingDates"

Here is sample data, in total there are about 5000 rows:

Symbol Count  MissingDates     
AD  27  1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11-29, 1996-12-09, 1996-12-20, 1996-12-23, 1996-12-26, 1996-12-27, 1997-01-02, 1997-05-02, 1997-09-10, 1998-01-02, 1998-04-16, 1998-12-08, 1999-12-27, 1999-12-31, 2001-09-12, 2003-08-06, 2003-10-13
BP  14  1995-08-09, 1995-08-15, 1995-12-26, 1996-01-02, 1996-09-06, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-09-12, 2002-12-24, 2003-08-06, 2003-10-13
C   3   1999-12-31, 2001-12-24, 2002-12-24
CC  285 1994-05-18, 1994-05-19, 1994-05-20, 1994-05-23, 1994-05-24, 1994-05-25, 1994-05-26, 1994-05-27, 1994-05-31, 1994-06-01, 1994-06-02, 1994-06-03, 1994-06-06, 1994-06-07, 1994-06-08, 1994-06-09, 1994-06-10, 1994-06-13, 1994-06-14, 1994-06-15, 1994-06-16, 1994-06-17, 1994-06-20, 1994-06-21, 1994-06-23, 1994-06-24, 1994-06-27, 1994-06-28, 1994-06-29, 1994-06-30, 1994-07-01, 1994-07-06, 1994-07-14, 1994-07-15, 1994-07-18, 1994-07-19, 1994-07-21, 1994-07-25, 1994-07-27, 1994-07-28, 1994-08-03, 1994-08-04, 1994-08-08, 1994-08-09, 1994-08-10, 1994-08-11, 1994-08-12, 1994-08-15, 1994-08-17, 1994-08-18, 1994-08-19, 1994-08-22, 1994-08-23, 1994-08-24, 1994-08-25, 1994-08-29, 1994-08-31, 1994-09-01, 1994-09-02, 1994-09-06, 1994-09-07, 1994-09-08, 1994-09-09, 1994-09-12, 1994-09-13, 1994-09-15, 1994-09-16, 1994-09-19, 1994-09-20, 1994-09-21, 1994-09-22, 1994-09-23, 1994-09-27, 1994-09-28, 1994-09-29, 1994-09-30, 1994-10-03, 1994-10-04, 1994-10-06, 1994-10-14, 1994-10-18, 1994-10-19, 1994-10-25, 1994-10-26, 1994-10-27, 1994-10-28, 1994-10-31, 1994-11-01, 1994-11-09, 1994-11-10, 1994-11-11, 1994-11-16, 1994-11-17, 1994-11-25, 1994-11-28, 1994-12-01, 1994-12-02, 1994-12-06, 1994-12-07, 1994-12-08, 1994-12-09, 1994-12-12, 1994-12-13, 1994-12-14, 1994-12-15, 1994-12-16, 1994-12-23, 1994-12-27, 1994-12-29, 1994-12-30, 1995-01-03, 1995-01-05, 1995-01-09, 1995-01-11, 1995-01-13, 1995-01-16, 1995-01-17, 1995-01-18, 1995-01-19, 1995-01-20, 1995-01-24, 1995-01-25, 1995-02-13, 1995-02-17, 1995-05-01, 1995-07-03, 1995-11-24, 1995-12-26, 1996-01-08, 1996-01-09, 1996-07-05, 1996-11-29, 1996-12-26, 1997-11-28, 1997-12-26, 1998-01-02, 1998-11-27, 1999-06-17, 1999-06-18, 1999-06-21, 1999-06-22, 1999-06-23, 1999-06-24, 1999-06-25, 1999-06-28, 1999-06-29, 1999-06-30, 1999-07-01, 1999-07-02, 1999-07-06, 1999-07-07, 1999-07-08, 1999-07-09, 1999-07-12, 1999-07-13, 1999-07-14, 1999-07-15, 1999-07-16, 1999-07-19, 1999-07-20, 1999-07-21, 1999-07-22, 1999-07-23, 1999-07-26, 1999-07-27, 1999-07-28, 1999-07-29, 1999-07-30, 1999-08-02, 1999-08-03, 1999-08-04, 1999-08-05, 1999-08-06, 1999-08-09, 1999-08-10, 1999-08-11, 1999-08-12, 1999-08-13, 1999-08-16, 1999-08-17, 1999-08-18, 1999-08-19, 1999-08-20, 1999-08-23, 1999-08-24, 1999-08-25, 1999-08-26, 1999-08-27, 1999-08-30, 1999-08-31, 1999-09-01, 1999-09-02, 1999-09-03, 1999-09-07, 1999-09-08, 1999-09-09, 1999-09-10, 1999-09-13, 1999-09-14, 1999-09-15, 1999-09-16, 1999-09-17, 1999-09-20, 1999-09-21, 1999-09-22, 1999-09-23, 1999-09-24, 1999-09-27, 1999-09-28, 1999-09-29, 1999-09-30, 1999-10-01, 1999-10-04, 1999-10-05, 1999-10-06, 1999-10-07, 1999-10-08, 1999-10-11, 1999-10-12, 1999-10-13, 1999-10-14, 1999-10-15, 1999-10-18, 1999-10-19, 1999-10-20, 1999-10-21, 1999-10-22, 1999-10-25, 1999-10-26, 1999-10-27, 1999-10-28, 1999-10-29, 1999-11-01, 1999-11-02, 1999-11-03, 1999-11-04, 1999-11-05, 1999-11-08, 1999-11-09, 1999-11-10, 1999-11-11, 1999-11-12, 1999-11-15, 1999-11-16, 1999-11-17, 1999-11-18, 1999-11-19, 1999-11-22, 1999-11-23, 1999-11-24, 1999-11-26, 1999-11-29, 1999-11-30, 1999-12-01, 1999-12-02, 1999-12-03, 1999-12-06, 1999-12-07, 1999-12-08, 1999-12-09, 1999-12-10, 1999-12-13, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-13, 2001-09-14, 2001-11-23, 2001-12-24, 2001-12-26, 2001-12-31, 2002-07-05, 2002-11-29, 2002-12-26, 2003-02-18, 2003-11-28, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24, 2011-01-03
CD  14  1995-08-09, 1995-12-26, 1996-01-02, 1996-06-11, 1996-06-20, 1996-09-09, 1996-09-11, 1996-12-26, 1997-01-02, 1997-12-26, 1998-01-02, 1998-04-16, 2001-01-02, 2001-09-12
CT  154 1995-11-24, 1996-01-08, 1996-07-05, 1996-11-29, 1996-12-24, 1997-11-28, 1997-12-26, 1998-11-27, 1999-11-26, 1999-12-31, 2000-07-03, 2000-11-24, 2001-09-11, 2001-09-12, 2001-09-13, 2001-09-14, 2001-11-12, 2001-11-23, 2001-12-24, 2001-12-31, 2002-05-21, 2002-05-22, 2002-05-23, 2002-05-24, 2002-05-28, 2002-05-29, 2002-05-30, 2002-05-31, 2002-06-03, 2002-06-04, 2002-06-05, 2002-06-06, 2002-06-07, 2002-06-10, 2002-06-11, 2002-06-12, 2002-06-13, 2002-06-14, 2002-06-17, 2002-06-18, 2002-06-19, 2002-06-20, 2002-06-21, 2002-06-24, 2002-06-25, 2002-06-26, 2002-06-27, 2002-06-28, 2002-07-01, 2002-07-02, 2002-07-03, 2002-07-05, 2002-07-08, 2002-07-09, 2002-07-10, 2002-07-11, 2002-07-12, 2002-07-15, 2002-07-16, 2002-07-17, 2002-07-18, 2002-07-19, 2002-07-22, 2002-07-23, 2002-07-24, 2002-07-25, 2002-07-26, 2002-07-29, 2002-07-30, 2002-07-31, 2002-08-01, 2002-08-02, 2002-08-05, 2002-08-06, 2002-08-07, 2002-08-08, 2002-08-09, 2002-08-12, 2002-08-13, 2002-08-14, 2002-08-15, 2002-08-16, 2002-08-19, 2002-08-20, 2002-08-21, 2002-08-22, 2002-08-23, 2002-08-26, 2002-08-27, 2002-08-28, 2002-08-29, 2002-08-30, 2002-09-03, 2002-09-04, 2002-09-05, 2002-09-06, 2002-09-09, 2002-09-10, 2002-09-11, 2002-09-12, 2002-09-13, 2002-09-16, 2002-09-17, 2002-09-18, 2002-09-19, 2002-09-20, 2002-09-23, 2002-09-24, 2002-09-25, 2002-09-26, 2002-09-27, 2002-09-30, 2002-10-01, 2002-10-02, 2002-10-03, 2002-10-04, 2002-10-07, 2002-10-08, 2002-10-09, 2002-10-10, 2002-10-11, 2002-10-14, 2002-10-15, 2002-10-16, 2002-10-17, 2002-10-18, 2002-10-21, 2002-10-22, 2002-10-23, 2002-10-24, 2002-10-25, 2002-10-28, 2002-10-29, 2002-10-30, 2002-10-31, 2002-11-01, 2002-11-04, 2002-11-05, 2002-11-06, 2002-11-07, 2002-11-29, 2002-12-24, 2003-02-18, 2003-11-28, 2003-12-26, 2004-01-02, 2004-06-11, 2004-11-26, 2004-12-31, 2005-11-25, 2006-11-24, 2007-01-02, 2007-11-23, 2007-12-24

So the function would have an argument passed where it would return the n most frequent occurrences of the dates above from the data.frame.

I looked at which.max but haven't been able to figure out how to apply it to multiple rows (entire data frame column), or to give me more than a single date (n) as output.

If the code would be a great deal simpler with only a single output value, that is acceptable as a starting point for me to work from. Any pointers are appreciated.

Here is a pastebin as I'm having trouble due to the length of the strings: http://pastebin.com/B1YPicC8

> str(gaps)
    'data.frame':   5560 obs. of  3 variables:
     $ Symbol      : Factor w/ 5560 levels "@AD#","@BP#",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ Count       : int  27 14 3 285 14 154 540 11 3 11 ...
     $ MissingDates: Factor w/ 3568 levels "1995-12-26, 1996-01-02, 1996-04-26, 1996-04-30, 1996-05-06, 1996-08-26, 1996-09-03, 1996-09-04, 1996-10-11, 1996-11-13, 1996-11"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10 ...

解决方案

It seems like you need something like:

Function

freqfunc <- function(x, n){
  tail(sort(table(unlist(strsplit(as.character(x), ", ")))), n)
}

Testing on your data set

freqfunc(gaps$MissingDates, 5) # Five most frequent dates

## 1996-12-26 1997-12-26 1998-01-02 1999-12-31 2001-09-12 
##          4          4          4          4          4 

这篇关于从数据框列中获取最频繁的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆