(每行)计算满足几列的OR条件的次数 [英] Counting (for each row) how many times an OR condition on several columns is satisfied

查看:78
本文介绍了(每行)计算满足几列的OR条件的次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题类似于这个,只是有些不同。在最初的问题中,我试图计算(对于每一行)满足条件的列数。我想做类似的事情,除了条件涉及带有OR条件的几列,而且我的真实数据有很多列,所以理想情况下,我想使用正则表达式引用这些列。

My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.

我有以下数据:

colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
   colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1     No   Yes    No    No    No   Yes   Yes    No Yes Yes
2     No   Yes   Yes   Yes   Yes    No   Yes    No  No  No
3     No    No    No   Yes    No    No    No    No Yes  No
4    Yes    No   Yes   Yes   Yes   Yes   Yes   Yes  No Yes
5    Yes    No   Yes    No    No    No    No   Yes  No Yes
6    Yes    No    No    No   Yes   Yes    No    No  No  No
7    Yes    No    No    No   Yes   Yes   Yes    No Yes  No
8    Yes    No   Yes    No   Yes   Yes    No   Yes Yes  No
9     No   Yes    No    No    No   Yes   Yes    No  No  No
10   Yes   Yes    No    No   Yes    No   Yes    No Yes  No
11    No   Yes    No    No   Yes    No   Yes   Yes  No  No
12    No   Yes   Yes   Yes    No    No   Yes    No  No  No
13    No    No   Yes   Yes    No   Yes   Yes   Yes Yes  No
14   Yes   Yes    No    No    No    No   Yes    No  No Yes
15   Yes    No   Yes   Yes    No   Yes    No   Yes  No  No
16    No   Yes   Yes    No    No    No   Yes    No  No  No
17   Yes    No    No    No    No   Yes   Yes   Yes  No Yes
18   Yes    No   Yes   Yes    No    No    No    No  No Yes
19    No    No    No    No    No   Yes    No    No  No Yes
20    No   Yes    No    No   Yes   Yes   Yes    No  No  No

我想创建一个新列 Nb 记录,对于每一行:至少一个co的次数lA2,colA3,colA4 ==是加上colB2,colB3,colB4中至少一个是==是。

I would like to create a new column Nb that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".

当查看一组列[colA2,colA3,colA4]时,并不隐含此 OR条件,而我要添加满足条件的列数,所以我可以使用类似的东西:

If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:

df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")

如果可能的话,我想使用regex来引用列,就像我的真实数据一样字母和数字分别比B和5更远。

I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.

谢谢!

推荐答案

您可以将 rowSums 方法调整为仅适用于每个OR条件中的列组,然后添加> 0 使其变为至少一个。因此,至少A值之一是看起来像是:

You could adapt your rowSums approach to just the groups of columns in each of your OR conditions, then add > 0 to make it "at least one." Thus, "at least one of the A values is Yes" would look like:

rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0

然后您可以使用 + 组合它们:

Then you can combine them using +:

(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
 (rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)






顺便说一句,您将更容易回答这些问题如果您的数据采用整洁的格式:也就是说,如果每一列是一个单独的变量。现在,您似乎将数据属性(A,B,1-4)存储为列名的一部分,这就是诸如使用值为'A'的列之类的操作非常尴尬的原因。如果您改为使用dplyr和tidyr软件包重新排列数据,怎么办:


Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:

library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
    separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
    mutate(number = as.numeric(number))

这会将您的数据重新排列为(请注意,我为您的每一行赋予了自己的索引变量):

This rearranges your data as (note that I gave each of your rows its own "index" variable):

  meh muh index col letter number value
1 Yes  No     1 col      A      1   Yes
2 Yes  No     2 col      A      1   Yes
3  No  No     3 col      A      1   Yes
4 Yes  No     4 col      A      1    No
5 Yes Yes     5 col      A      1    No
6 Yes Yes     6 col      A      1   Yes

然后您可以更自然地对这些观察进行分组,汇总,过滤和操作。例如,您似乎想删除带有数字1的列,而不需要正则表达式,只需执行 newdf%>%filter(number> 1)

You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1).

在这里,您将执行所描述的或运算:

Here's how you would perform the kind of OR operation you're describing:

hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
              summarize(hasyes = any(value == "Yes"))

对于每个原始行+字母组合,您现在都有一个逻辑值是否显示

For each of your original row+letter combinations, you now have a logical value for whether Yes appears:

  index letter hasyes
1     1      A   TRUE
2     1      B   TRUE
3     2      A   TRUE
4     2      B   TRUE
5     3      A  FALSE
6     3      B   TRUE

再进行一次汇总操作会将其转换为您想要的形式:

One more summarizing operation gets this into the form you want:

result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))

此解决方案的重要之处在于它适用于任何数字的字母数量(也就是说,如果它来自AZ而不是A和B)同样容易。

What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.

这篇关于(每行)计算满足几列的OR条件的次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆