(每行)计算满足几列的OR条件的次数 [英] Counting (for each row) how many times an OR condition on several columns is satisfied
问题描述
我的问题类似于这个,只是有些不同。在最初的问题中,我试图计算(对于每一行)满足条件的列数。我想做类似的事情,除了条件涉及带有OR条件的几列,而且我的真实数据有很多列,所以理想情况下,我想使用正则表达式引用这些列。
My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.
我有以下数据:
colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1 No Yes No No No Yes Yes No Yes Yes
2 No Yes Yes Yes Yes No Yes No No No
3 No No No Yes No No No No Yes No
4 Yes No Yes Yes Yes Yes Yes Yes No Yes
5 Yes No Yes No No No No Yes No Yes
6 Yes No No No Yes Yes No No No No
7 Yes No No No Yes Yes Yes No Yes No
8 Yes No Yes No Yes Yes No Yes Yes No
9 No Yes No No No Yes Yes No No No
10 Yes Yes No No Yes No Yes No Yes No
11 No Yes No No Yes No Yes Yes No No
12 No Yes Yes Yes No No Yes No No No
13 No No Yes Yes No Yes Yes Yes Yes No
14 Yes Yes No No No No Yes No No Yes
15 Yes No Yes Yes No Yes No Yes No No
16 No Yes Yes No No No Yes No No No
17 Yes No No No No Yes Yes Yes No Yes
18 Yes No Yes Yes No No No No No Yes
19 No No No No No Yes No No No Yes
20 No Yes No No Yes Yes Yes No No No
我想创建一个新列 Nb
记录,对于每一行:至少一个co的次数lA2,colA3,colA4 ==是加上colB2,colB3,colB4中至少一个是==是。
I would like to create a new column Nb
that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".
当查看一组列[colA2,colA3,colA4]时,并不隐含此 OR条件,而我要添加满足条件的列数,所以我可以使用类似的东西:
If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:
df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")
如果可能的话,我想使用regex来引用列,就像我的真实数据一样字母和数字分别比B和5更远。
I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.
谢谢!
推荐答案
您可以将 rowSums
方法调整为仅适用于每个OR条件中的列组,然后添加> 0
使其变为至少一个。因此,至少A值之一是看起来像是:
You could adapt your rowSums
approach to just the groups of columns in each of your OR conditions, then add > 0
to make it "at least one." Thus, "at least one of the A values is Yes" would look like:
rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0
然后您可以使用 +
组合它们:
Then you can combine them using +
:
(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
(rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)
顺便说一句,您将更容易回答这些问题如果您的数据采用整洁的格式:也就是说,如果每一列是一个单独的变量。现在,您似乎将数据属性(A,B,1-4)存储为列名的一部分,这就是诸如使用值为'A'的列之类的操作非常尴尬的原因。如果您改为使用dplyr和tidyr软件包重新排列数据,怎么办:
Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:
library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
mutate(number = as.numeric(number))
这会将您的数据重新排列为(请注意,我为您的每一行赋予了自己的索引变量):
This rearranges your data as (note that I gave each of your rows its own "index" variable):
meh muh index col letter number value
1 Yes No 1 col A 1 Yes
2 Yes No 2 col A 1 Yes
3 No No 3 col A 1 Yes
4 Yes No 4 col A 1 No
5 Yes Yes 5 col A 1 No
6 Yes Yes 6 col A 1 Yes
然后您可以更自然地对这些观察进行分组,汇总,过滤和操作。例如,您似乎想删除带有数字1的列,而不需要正则表达式,只需执行 newdf%>%filter(number> 1)
。
You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1)
.
在这里,您将执行所描述的或运算:
Here's how you would perform the kind of OR operation you're describing:
hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
summarize(hasyes = any(value == "Yes"))
对于每个原始行+字母组合,您现在都有一个逻辑值是否显示是
:
For each of your original row+letter combinations, you now have a logical value for whether Yes
appears:
index letter hasyes
1 1 A TRUE
2 1 B TRUE
3 2 A TRUE
4 2 B TRUE
5 3 A FALSE
6 3 B TRUE
再进行一次汇总操作会将其转换为您想要的形式:
One more summarizing operation gets this into the form you want:
result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))
此解决方案的重要之处在于它适用于任何数字的字母数量(也就是说,如果它来自AZ而不是A和B)同样容易。
What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.
这篇关于(每行)计算满足几列的OR条件的次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!