在R数据帧上进行快速测试,以查看一列中的行值是否在数据帧中的另一列内 [英] Speedy test on R data frame to see if row values in one column are inside another column in the data frame

查看:151
本文介绍了在R数据帧上进行快速测试,以查看一列中的行值是否在数据帧中的另一列内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个22k记录和6列的营销数据的数据框架,其中2个是感兴趣的。

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.


  • 变量

  • FO.variable

以下是一个包​​含数据框示例的输出输出的链接: http:// dpaste .com / 2SJ6DPX

Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX

请告诉我是否有更好的分享数据的方式。

Please let me know if there's a better way of sharing this data.

所有我想做的是创建一个额外的二进制保持列,应该是:

All I want to do is create an additional binary keep column which should be:


  • 1如果FO.variable在变量

  • 如果FO.Variable不在变量内则为0

看起来像一个简单的事情... Excel我只需添加一个带有if公式的列,然后粘贴公式。我花了过去的几个小时试图得到这个和R和失败。

Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.

这是我尝试过的:


  1. 使用grepl模式匹配。我使用grepl之前,但这一次我想传递一个列而不是一个字符串。我早期的尝试失败,因为我试图强制grepl和ifelse导致grepl使用列中的第一个值,而不是整个事情。

  1. Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.

我的下一次尝试是使用transform和grep基于SO上的另一个帖子。我不认为这会给我我的确切答案,但我认为它会让我足够接近我从中得出它的代码运行了一段时间,因为无效的下标错误。

My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.

transform(dd,Keep = FO.variable [sapply(variable,grep,FO.variable)])

我的下一次尝试是使用str_detect,但我不认为这是正确的方法,因为我想要行级值,将在字面上使用向量中的任何值?

My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?

kk < - sapply(dd $ variable,function(x) fo.variable,str_detect,string = x)))

我喜欢一个向量化的方法,但我在这一点非常绝望。我没有使用for循环之前,因为我已经避免了他们,坚持其他解决方案。它似乎没有工作完全正确不知道如果我搞砸了语法:

Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:

for(i in 1:nrow(dd)){
if(dd [i,4]%in%dd [i,2])
dd $ test [i]
}

如前所述,如果FO.variable在变量里面,我的理想输出是一个额外的1或0。例如,示例数据中的前三个记录为1,第四个记录为零,因为直接/未知不在自然搜索,系统电子邮件中。

As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".

如果解决方案可以快速运行,奖金将会增加。应用选项花了很长时间,也许是因为它们在两个列的每个迭代上循环?

A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?

这结果并不像我想要的那么简单思想。或者也许是,我只是一个笨。无论哪种方式,我很感激任何帮助如何最好地接近这一点。

This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

推荐答案

我读取数据

df = dget("http://dpaste.com/2SJ6DPX.txt")

然后将变量列分成其各个部分,并确定每个条目的长度。

then split the 'variable' column into its parts and figured out the lengths of each entry

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

然后我不公开v并创建了一个索引,将未列出的v映射到它来自的行

Then I unlisted v and created an index that maps the unlisted v to the row from which it came from

uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable

或组合(看起来更有用的是返回逻辑向量比修改的data.frame, dd $ Keep = f0(dd)

test = (uv == as.character(df$FO.variable)[idx]) df$Keep = FALSE df$Keep[ idx[test] ] = TRUE

Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))

(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))


b $ b

with

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

在定时开发过程中有两个微妙而重要的补充,表达式,并将因素强制为字符。

Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

这篇关于在R数据帧上进行快速测试,以查看一列中的行值是否在数据帧中的另一列内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆