计算一列的平均数 [英] Computing averages of chunks of a column

查看:54
本文介绍了计算一列的平均数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个制表符delim文件

I have a tab delim file

LOC105758527    1       55001   0.469590
LOC105758527    1       65001   0.067909
LOC105758527    1       75001   0.220712
LOC100218126    1       85001   0.174872
LOC105758529    1       125001  0.023420
NRF1    1       155001  0.242222
NRF1    1       165001  0.202569
NRF1    1       175001  0.327963
UBE2H   1       215001  0.063989
UBE2H   1       225001  0.542340
KLHDC10 1       255001  0.293471
KLHDC10 1       265001  0.231621
KLHDC10 1       275001  0.142917
TMEM209 1       295001  0.273941
CPA2    1       315001  0.181312

我需要为col 1中的每个元素计算col 4的平均值,因此求和/行数并在计算中打印第一行的col1,2,3,将平均值打印为col 4.

I need to calculate the average for col 4 for each element in col 1. So the sum/line count and print col1,2,3 of the 1st line in the computation and the avg as col 4.

我从做和开始

awk 'BEGIN { FS = OFS = "\t" }
        { y[$1] += $4; $4 = y[$1]; x[$1] = $0; }
END { for (i in x) { print x[i]; } }' file

但是我得到

NRF1    1       175001  0.772754
LOC105758529    1       125001  0.02342
LOC100218126    1       85001   0.174872
KLHDC10 1       275001  0.668009
CPA2    1       315001  0.181312
TMEM209 1       295001  0.273941
UBE2H   1       225001  0.606329
LOC105758527    1       75001   0.758211

这意味着它正在跳到文件中第一行以外的其他行(并从计算出的最后一行开始打印col1,2,3-很好,但我更喜欢第一行).输出不正常.

Which means it's jumping to some line other than the 1st in my file (and printing col1,2,3 from the last line calculated - which is fine but I would prefer the 1st line instead). The output is out of order.

我也不知道如何将总和除以NR,以求平均值.

I also don't know how to divide the sum by their NRs to actually get the average

推荐答案

通过使用数组存储行顺序和中间计算步骤,就可以在awk中完成此操作:

It can be done in just awk by using arrays to store line ordering and intermediate computation steps:

# set fields delimiters
BEGIN { FS = OFS = "\t" }

# print the header
NR==1 { print; next }

# the first time col1 value occurs, store col1..col3
!h[$1] {
    h[$1] = ++n  # save ordering
    d[n] = $1 OFS $2 OFS $3  # save first 3 columns
}

# store sum and quantity of col4
{
    i = h[$1]  # recover ordering
    s[i] += $4
    q[i]++
}

# output col1..col3 and the average value
END {
    for (i=1; i<=n; i++) print d[i], s[i]/q[i]
}


自从我撰写以上内容后,我看到您已经编辑了问题.如果您的数据没有标题,则不需要 NR == 1 行.

如果您的数据文件确实很大,则上面的脚本可能会占用过多的内存(它将使用与col1的唯一值数量成比例的内存).如果这将成为问题,并且输出行的顺序并不重要,则可以通过对数据进行预排序(也许使用 sort -k1,1 -s )并产生输出来大幅减少内存使用量递增:

If your data file is really big, the script above may consume too much memory (it will use memory proportional to the number of unique values for col1). If this will be problematic and the order of the output lines is not important, memory usage can be reduced drastically by pre-sorting the data (perhaps with sort -k1,1 -s), and producing output incrementally:

BEGIN { FS = OFS = "\t" }

$1 != c1 {
    if (c1) print d, s/q
    d = $1 OFS $2 OFS $3
    s = q = 0
    c1 = $1
}

{
    s += $4
    q++
}

END { print d, s/q }

这篇关于计算一列的平均数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆