快速测试 R 数据框以查看一列中的行值是否在数据框中的另一列内 [英] Speedy test on R data frame to see if row values in one column are inside another column in the data frame

查看:27
本文介绍了快速测试 R 数据框以查看一列中的行值是否在数据框中的另一列内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 22k 条记录和 6 列的营销数据数据框,其中 2 列是我们感兴趣的.

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.

  • 变量
  • FO.变量

这是数据帧示例的 dput 输出的链接:http://dpaste.com/2SJ6DPX

Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX

如果有更好的方式来共享这些数据,请告诉我.

Please let me know if there's a better way of sharing this data.

我想要做的就是创建一个额外的二进制保持列,它应该是:

All I want to do is create an additional binary keep column which should be:

  • 1 如果 FO.variable 在 Variable 内
  • 0 如果 FO.Variable 不在 Variable 内

看起来很简单……在 Excel 中,我只需添加带有if"公式的另一列,然后将公式粘贴下来.在过去的几个小时里,我一直试图得到这个和 R 并失败了.

Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.

这是我尝试过的:

  1. 使用 grepl 进行模式匹配.我以前使用过 grepl,但这次我试图传递一列而不是字符串.我早期的尝试失败了,因为我试图强制 grepl 和 ifelse 导致 grepl 使用列中的第一个值而不是整个值.

  1. Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.

我的下一次尝试是根据 SO 上的另一篇文章使用转换和 grep.我不认为这会给我我的确切答案,但我认为它会让我足够接近让我从那里弄清楚......代码运行了一段时间而不是因为无效的下标而出错.

My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.

transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])

我的下一次尝试是使用 str_detect,但我认为这不是正确的方法,因为我想要行级值并且我认为any"实际上会使用向量中的任何值?

My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?

kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))

刚刚尝试了一个 for 循环.我更喜欢矢量化方法,但此时我非常绝望.我之前没有使用过 for 循环,因为我避免了它们并坚持使用其他解决方案.它似乎不太正确,不确定我是否搞砸了语法:

Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:

for(i in 1:nrow(dd)){if(dd[i,4] %in% dd[i,2])dd$test[i] <- 1}

正如我所提到的,如果 FO.variable 在变量内部,我的理想输出是一个带有 1 或 0 的附加列.例如,示例数据中的前三个记录将为 1,第四个记录将为 0,因为直接/未知"不在有机搜索,系统电子邮件"中.

As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".

如果一个解决方案可以快速运行,那将是一个奖励.应用选项花费了很长时间,可能是因为它们在两列之间的每次迭代中循环?

A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?

事实证明这几乎没有我想象的那么简单.或者也许是这样,而我只是个笨蛋.无论哪种方式,我都感谢有关如何最好地解决此问题的任何帮助.

This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

推荐答案

我读了数据

df = dget("http://dpaste.com/2SJ6DPX.txt")

然后将变量"列拆分为各个部分并计算出每个条目的长度

then split the 'variable' column into its parts and figured out the lengths of each entry

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

然后我取消了 v 并创建了一个索引,将未列出的 v 映射到它来自的行

Then I unlisted v and created an index that maps the unlisted v to the row from which it came from

uv = unlist(v)
idx = rep(seq_along(v), len)

最后,我在 FO.variable 中找到了 uv 等于其对应条目的索引

Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable

test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE

或组合(返回逻辑向量似乎比修改后的 data.frame 更有用,后者可以通过 dd$Keep = f0(dd) 获得)

Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))

f0 = function(dd) {
    v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
    len = lengths(v)
    uv = unlist(v)
    idx = rep(seq_along(v), len)

    keep = logical(nrow(dd))
    keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
    keep
}

(使用列是因子的事实可以使这更快,但这可能不是故意的?)与(公认的更简单和更容易理解)相比

(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)

f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

在时序开发过程中两个微妙但重要的添加是在正则表达式中使用 fixed=TRUE,并将因素强制转换为字符.

Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

这篇关于快速测试 R 数据框以查看一列中的行值是否在数据框中的另一列内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆