R:`split` 保留因素的自然顺序 [英] R: `split` preserving natural order of factors
问题描述
split
将始终按字典顺序对拆分进行排序.在某些情况下,人们宁愿保留自然秩序.人们总是可以实现手动功能,但是否有基本的 R 解决方案可以做到这一点?
split
will always order the splits lexicographically. There may be situations where one would rather preserve the natural order. One can always implement a hand-rolled function but is there a base R solution that does this?
输入:
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1 2013-04-01 INDUSINDBK SIEMENS 4 2013
2 2013-04-01 NMDC WIPRO 4 2013
3 2012-09-28 LUPIN SAIL 9 2012
4 2012-09-28 ULTRACEMCO STER 9 2012
5 2012-04-27 ASIANPAINT RCOM 4 2012
6 2012-04-27 BANKBARODA RPOWER 4 2012
split
输出:
R> split(nifty.dat, nifty.dat$yearmon)
$`4 2012`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
5 2012-04-27 ASIANPAINT RCOM 4 2012
6 2012-04-27 BANKBARODA RPOWER 4 2012
$`4 2013`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1 2013-04-01 INDUSINDBK SIEMENS 4 2013
2 2013-04-01 NMDC WIPRO 4 2013
$`9 2012`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
3 2012-09-28 LUPIN SAIL 9 2012
4 2012-09-28 ULTRACEMCO STER 9 2012
请注意,yearmon
已经按照我喜欢的特定顺序进行了排序.这可以被认为是给定的,因为如果这不成立,这个问题就有点错误指定了.
Note that yearmon
is already sorted in a particular order I will like. This can be taken as given because the question is slightly mis-specified if this does not hold.
所需的输出:
$`4 2013`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1 2013-04-01 INDUSINDBK SIEMENS 4 2013
2 2013-04-01 NMDC WIPRO 4 2013
$`9 2012`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
3 2012-09-28 LUPIN SAIL 9 2012
4 2012-09-28 ULTRACEMCO STER 9 2012
$`4 2012`
Date.of.Inclusion Securities.Included Securities.Excluded yearmon
5 2012-04-27 ASIANPAINT RCOM 4 2012
6 2012-04-27 BANKBARODA RPOWER 4 2012
<小时>
谢谢.
PS:我知道有更好的方法来创建 yearmon
以保留该顺序,但我正在寻找通用解决方案.
PS: I know there are better ways to create yearmon
to preserve that order but I am looking for a generic solution.
推荐答案
split
将 f
(第二个)参数转换为因子,如果它不是因子的话.因此,如果您希望保留顺序,请自己将列分解为所需的级别.即:
split
converts the f
(second) argument to factors, if it isn't already one. So, if you want the order to be retained, factor the column yourself with the desired level. That is:
df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
# now split
split(df, df$yearmon)
# $`4_2013`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 1 2013-04-01 INDUSINDBK SIEMENS 4_2013
# 2 2013-04-01 NMDC WIPRO 4_2013
# $`9_2012`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 3 2012-09-28 LUPIN SAIL 9_2012
# 4 2012-09-28 ULTRACEMCO STER 9_2012
# $`4_2012`
# Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 5 2012-04-27 ASIANPAINT RCOM 4_2012
# 6 2012-04-27 BANKBARODA RPOWER 4_2012
<小时>
但不要使用split
.使用 data.table
代替:
但是通常情况下,随着级别的增加,split
往往非常慢.所以,我建议使用 data.table
子集到一个列表.我想那会快得多!
But do not use split
. Use data.table
instead:
However normally, split
tends to be terribly slow as the levels increase. So, I'd suggest using data.table
to subset to a list. I'd suppose that'd be much faster!
require(data.table)
dt <- data.table(df)
dt[, grp := .GRP, by = yearmon]
setkey(dt, grp)
o2 <- dt[, list(list(.SD)), by = grp]$V1
<小时>
对海量数据进行基准测试:
set.seed(45)
dates <- seq(as.Date("1900-01-01"), as.Date("2013-12-31"), by = "days")
ym <- do.call(paste, c(expand.grid(1:500, 1900:2013), sep="_"))
df <- data.frame(x1 = sample(dates, 1e4, TRUE),
x2 = sample(letters, 1e4, TRUE),
x3 = sample(10, 1e4, TRUE),
yearmon = sample(ym, 1e4, TRUE),
stringsAsFactors=FALSE)
require(data.table)
dt <- data.table(df)
f1 <- function(dt) {
dt[, grp := .GRP, by = yearmon]
setkey(dt, grp)
o1 <- dt[, list(list(.SD)), by=grp]$V1
}
f2 <- function(df) {
df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
o2 <- split(df, df$yearmon)
}
require(microbenchmark)
microbenchmark(o1 <- f1(dt), o2 <- f2(df), times = 10)
# Unit: milliseconds
expr min lq median uq max neval
# o1 <- f1(dt) 43.72995 43.85035 45.20087 715.1292 1071.976 10
# o2 <- f2(df) 4485.34205 4916.13633 5210.88376 5763.1667 6912.741 10
请注意,o1
的解决方案将是一个未命名列表.但是您可以简单地通过执行 names(o1) <- unique(dt$yearmon)
Note that the solution from o1
will be an unnamed list. But you can set the names simply by doing names(o1) <- unique(dt$yearmon)
这篇关于R:`split` 保留因素的自然顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!