计算每行条件R的实例数 [英] Counting number of instances of a condition per row R
问题描述
我有一个大文件,第一列是ID,其余1304列是如下基因型.
I have a large file with the first column being IDs, and the remaining 1304 columns being genotypes like below.
rsID sample1 sample2 sample3...sample1304
abcd aa bb nc nc
efgh nc nc nc nc
ijkl aa ab aa nc
我想计算每行"nc"值的数量并将其结果输出到另一列,以便得到以下信息:
I would like to count the number of "nc" values per row and output the result of that to another column so that I get the following:
rsID sample1 sample2 sample3...sample1304 no_calls
abcd aa bb nc nc 2
efgh nc nc nc nc 4
ijkl aa ab aa nc 1
表函数计算每列而不是行的频率,如果我转置要在表函数中使用的数据,则我需要文件看起来像这样:
The table function counts frequencies per column, not row and if I transpose the data to use in the table function, I would need the file to look like this:
abcd aa[sample1]
abcd bb[sample2]
abcd nc[sample3] ...
abcd nc[sample1304]
efgh nc[sample1]
efgh nc[sample2]
efgh nc[sample3] ...
efgh nc[sample1304]
使用这种格式,我将得到以下内容:
With this format, I would get the following which is what I want:
ID nc aa ab bb
abcd 2 1 0 1
efgh 4 0 0 0
有人对通过行获取频率的简单方法有任何想法吗?我现在正在尝试此操作,但是要花很多时间才能运行:
Does anybody have any idea of an simple way to get frequencies by row? I am trying this right now, but it is taking quite some time to run:
rsids$Number_of_no_calls <- apply(rsids, 1, function(x) sum(x=="NC"))
推荐答案
您可以使用rowSums
.
df$no_calls <- rowSums(df == "nc")
df
# rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd aa bb nc nc 2
#2 efgh nc nc nc nc 4
#3 ijkl aa ab aa nc 1
或者,正如MrFlick所指出的那样,要从行总和中排除第一列,您可以稍作修改
Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to
df$no_calls <- rowSums(df[-1] == "nc")
关于行名:rowSums
中不计入行名,您可以进行简单的测试来演示它:
Regarding the row names: They are not counted in rowSums
and you can make a simple test to demonstrate it:
rownames(df)[1] <- "nc" # name first row "nc"
rowSums(df == "nc") # compute the row sums
#nc 2 3
# 2 4 1 # still the same in first row
这篇关于计算每行条件R的实例数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!