基于“列”的内容的子集数据帧列表 [英] Subsetting Data Frame Based on Contents of a "Column" List

查看:169
本文介绍了基于“列”的内容的子集数据帧列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表矩阵,其中一列列是一个列表(我意识到这是一个奇数数据集可以使用,但我发现它对其他操作很有用)。列表的每个条目都是; (1)空(整数(0)),(2)整数,或(3)整数向量。

I have a list matrix, where one of the "columns" is a list (I realize it's an odd dataset to work with, but I find it useful for other operations). Each entry of the list is either; (1) empty (integer(0)), (2) an integer, or (3) a vector of integers.

例如。 R对象d.f,用d.f $ ID为索引向量,而d.f $ Basket_List为列表。

E.g. the R object "d.f", With d.f$ID an index vector, and d.f$Basket_List the list.

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)
d.f$Basket_List <- Basket_List



我的问题



问题1



我想创建一个新的数据集,基于Basket_List是否包含某些值。例如。 d.f中的所有行的子集,使得Bask_list具有123或123& 987 - 或其他更复杂的条件。

My Question

Issue 1

I'd like to create a new dataset that's a subset of the initial, based on whether or not "Basket_List" contains certain value(s). E.g. a subset of all the rows in d.f such that Bask_list has "123" or "123" & "987" -- or other more complicated conditions.

我尝试过以下各种变体,但无济于事。

I've tried every variation of the following, but to no avail.

d.f2 <- subset(d.f, 123 %in% Basket_List)
d.f2 <- subset(d.f, 123 == any(Basket_List))
d.f2 <- d.f[which(123 %in% d.f$Basket_List,]
# should return the subset, with rows 2,3,5,7 & 8



第2期



我的另一个问题是,我将运行这个操作超过数百万行(这是交易数据),所以我想尽可能的优化速度(我有一个复杂的循环现在,但需要太多时间)。

Issue 2

My other issue is that'd I'll be running this operation over many millions of rows (it's transaction data), so I'd like to optimize it as much as possible for speed (I have a complicated for loop now, but it takes too much time).

如果您认为可能有用,数据也可能如下所示:

If you think it might be useful, the data might also be set-up as the following:

ID <- c(1,2,2,3,3,4,5,5,6,7,7,8,8,9)
Basket <- c(NA,123,987,123,123,456,456,123,456,123,987,987,123,987)
alt.d.f <- data.frame(ID,Basket)


推荐答案

您可以使用 sapply

ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
                    c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)

sel <- sapply( Basket_List, function(bl,searchItem) {
  any(searchItem %in% bl)
}, searchItem=c(123) )

> sel
[1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE

> d.f[sel,,drop=FALSE]
  ID
2  2
3  3
5  5
7  7
8  8

请注意您的术语。 data.frame不是矩阵。这是一种类型的列表。

Please be careful with your terminology. A data.frame is not a matrix. It's a type of list.

速度, sapply 不是最快的,但选择将是非常快速,因为它被矢量化。如果你需要更多的速度, data.table time。

Speed-wise, sapply is not the fastest, but the selection will be very fast since it is vectorized. If you need more speed, data.table time.

这篇关于基于“列”的内容的子集数据帧列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆