Datamash 1.7在浮点值合并时输出零 [英] Datamash 1.7 outputs zero on floating point values binning
问题描述
我在Centos 7.7 Linux x86_64计算机上使用Datamash 1.7来分类和分类24 GB的数据.输入数据如下(仅前50个样本)
Ind_poob
0.040618
0.006233
0.004652
0.003559
0.001752
0.001605
0.007701
0.004722
0.029899
0.00104
0.014031
6.1e-5
0.002144
0.002385
0.001145
0
0.001463
0
0.003414
0
0.001602
9.75e-4
0.007218
6.4e-5
0.006426
0
7.2e-5
1.13e-4
1.5e-4
0
4.19e-4
0.009325
7e-5
0.006592
0.01
0
0.001605
0.001924
0.003714
0.00335
0.001876
5.52e-4
0
0.019234
0.001415
1e-5
0
0.004304
2.15e-4
所需的输出(放大后)
#number bin_number
4061.8 4061.8
623.3 620.00
465.2 460.00
355.9 350.00
175.2 170.00
160.5 160.00
770.1 770.00
472.2 470.00
2989.9 2980.00
104 100.00
1403.1 1400.00
6.1 0.00
214.4 210.00
238.5 230.00
114.5 110.00
0 0.00
146.3 140.00
0 0.00
341.4 340.00
0 0.00
160.2 160.00
97.5 90.00
721.8 720.00
6.4 0.00
642.6 640.00
0 0.00
7.2 0.00
11.3 10.00
15 10.00
0 0.00
41.9 40.00
932.5 930.00
7 0.00
659.2 650.00
1000 1000.00
0 0.00
160.5 160.00
192.4 190.00
371.4 370.00
335 330.00
187.6 180.00
55.2 50.00
0 0.00
1923.4 1920.00
141.5 140.00
1 0.00
0 0.00
430.4 430.00
21.5 20.00
但是使用Datamash命令:datamash -H --format=%.8f -s bin 1 <test_data.txt
,我得到了
bin(ind_poob)
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
如何格式化datamash命令以正确的浮点格式对输入数据进行排序和装箱?其次,考虑到原始输入的大小为24 GB,是否可以在使用Gnuplot进行分箱后对其进行绘制?
查看源代码(不幸的是,由于分档在文档中没有很好地描述),因此数字分档是通过以下代码完成的:
const long double val = num_value / op->params.bin_bucket_size;
modfl (val, & op->value);
/* signbit will take care of negative-zero as well. */
if (signbit (op->value))
--op->value;
op->value *= op->params.bin_bucket_size;
基本上,它是将数字除以存储桶大小(默认值为100)的整数部分,然后将其乘以存储桶大小.因此,由于样本数据中的所有数字都在[0,1)
范围内,因此每个数字都将位于同一个0存储桶中.
您可以尝试通过将数据乘以1e4(或更多)来缩放数据,以查看是否可以得到更好的数字(此外,无需对数据进行排序-您可以省略-s
选项)./p>
另一种方法是将值视为字符串,而不是数字,并使用strbin
,它使用另一种可能更适合您的算法:
$ datamash -H --full strbin:100 1 < test_data.txt
Ind_poob strbin(Ind_poob)
0.040618 60
0.006233 27
0.004652 70
0.003559 5
0.001752 30
0.001605 29
0.007701 37
0.004722 78
0.029899 25
0.00104 60
0.014031 17
6.1e-5 93
0.002144 84
0.002385 21
0.001145 57
...
I am using Datamash 1.7 on Centos 7.7 Linux x86_64 machine to sort and bin data which is 24 GB in size. Input data looks as follows (only first 50 samples)
Ind_poob
0.040618
0.006233
0.004652
0.003559
0.001752
0.001605
0.007701
0.004722
0.029899
0.00104
0.014031
6.1e-5
0.002144
0.002385
0.001145
0
0.001463
0
0.003414
0
0.001602
9.75e-4
0.007218
6.4e-5
0.006426
0
7.2e-5
1.13e-4
1.5e-4
0
4.19e-4
0.009325
7e-5
0.006592
0.01
0
0.001605
0.001924
0.003714
0.00335
0.001876
5.52e-4
0
0.019234
0.001415
1e-5
0
0.004304
2.15e-4
Desired Output (after scaling up)
#number bin_number
4061.8 4061.8
623.3 620.00
465.2 460.00
355.9 350.00
175.2 170.00
160.5 160.00
770.1 770.00
472.2 470.00
2989.9 2980.00
104 100.00
1403.1 1400.00
6.1 0.00
214.4 210.00
238.5 230.00
114.5 110.00
0 0.00
146.3 140.00
0 0.00
341.4 340.00
0 0.00
160.2 160.00
97.5 90.00
721.8 720.00
6.4 0.00
642.6 640.00
0 0.00
7.2 0.00
11.3 10.00
15 10.00
0 0.00
41.9 40.00
932.5 930.00
7 0.00
659.2 650.00
1000 1000.00
0 0.00
160.5 160.00
192.4 190.00
371.4 370.00
335 330.00
187.6 180.00
55.2 50.00
0 0.00
1923.4 1920.00
141.5 140.00
1 0.00
0 0.00
430.4 430.00
21.5 20.00
but with Datamash command: datamash -H --format=%.8f -s bin 1 <test_data.txt
, I am getting
bin(ind_poob)
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
0.00000000
How can I format datamash command to sort and bin input data with correct floating point format? Secondly, will it possible to plot it after binning using Gnuplot given the size of original input being 24 GB?
Looking at the source (Since unfortunately binning isn't described very well in the documentation), numeric binning is done by this code:
const long double val = num_value / op->params.bin_bucket_size;
modfl (val, & op->value);
/* signbit will take care of negative-zero as well. */
if (signbit (op->value))
--op->value;
op->value *= op->params.bin_bucket_size;
Basically, it takes the integer part of dividing the number by the bucket size (where the default is 100), and multiplies that by the bucket size. So since all your numbers in your sample data are in the range [0,1)
, every one will be in the same 0 bucket.
You might try scaling your data by multiplying it by 1e4 (Or more) to see if that'll give you better numbers (Also, no need to sort the data - you can leave off the -s
option).
Another approach is to treat the values as strings, not numbers, and use strbin
, which uses a different algorithm that might work better for you:
$ datamash -H --full strbin:100 1 < test_data.txt
Ind_poob strbin(Ind_poob)
0.040618 60
0.006233 27
0.004652 70
0.003559 5
0.001752 30
0.001605 29
0.007701 37
0.004722 78
0.029899 25
0.00104 60
0.014031 17
6.1e-5 93
0.002144 84
0.002385 21
0.001145 57
...
这篇关于Datamash 1.7在浮点值合并时输出零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!