计算平均每列忽略使用awk缺失数据的 [英] Calculate mean of each column ignoring missing data with awk
问题描述
我有数千行和几十列的大型制表符分隔的数据表,它已经失踪标记为不适用的数据。例如,
NA NA 0.93 NA 0 0.51
1 1 1 NA NA 1
1 NA NA 0.97 1
0.92 NA 1 1 0.01 0.34
我想计算每一列的平均值,但在确认丢失的数据在计算中忽略。例如,第1列的平均值应是0.97。我相信我可以用 AWK
,但我不知道如何构建命令为缺失数据的所有列和帐户做到这一点。
我只知道怎么做是计算单个列的意思,但它把丢失的数据为0而不是让出来的计算。
的awk'{总和+ = $ 1} END {打印总和/ NR}'文件名
这是模糊的,但适用于您的例子
的awk'{为(i = 1; I< = NF;我++){总和[I] + = $ I;如果(!$ I =NA){算[I] + = 1}}} END {为(i = 1; I< = NF;我++){如果(!算上[I] = 0){V =总和[I] /计数[I]}其他{v = 0};如果(I< NF){printf的%F \\ t的,V}其他{打印V}}}'input.txt中
编辑:
这里是它如何工作的:
的awk'{为(i = 1; I< = NF;我++){#for每列
综上所述[I] + = $ I; #将总和的总和阵
如果($ I!=NA){#如果值不是NA
算上[I] + = 1} #increment列计数
} #万一
} #endfor
END {#at结束
对于(i = 1; I< = NF;我++){#for每列
如果(计数由[i]!= 0){#如果列计数不为0
V =总和[I] /计数[I] #then计算列是什么意思(这里重新以Vpsented $ P $)
}其他{的#else(如果列数为0)
V = 0 #then让的意思是0(注:你可以设置这是NA)
};的#endif山口计数不为0
如果(ⅰ&下; NF){#如果该列是最后一列前
printf的%F \\ t的,V #PRINT意思+ TAB
}否则{的#else(如果它是最后一列)
打印V} #PRINT平均值+ NEWLINE
}; #万一
}'input.txt的#endfor(注:input.txt的是输入文件)
```
I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,
na 0.93 na 0 na 0.51
1 1 na 1 na 1
1 1 na 0.97 na 1
0.92 1 na 1 0.01 0.34
I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk
but I am not sure how to construct the command to do this for all columns and account for missing data.
All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.
awk '{sum+=$1} END {print sum/NR}' filename
This is obscure, but works for your example
awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
EDIT: Here is how it works:
awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)
```
这篇关于计算平均每列忽略使用awk缺失数据的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!