从两个预先计算的直方图报告两个样本的K-S统计量 [英] Report a two-sample K-S statistic from two precomputed histograms
问题描述
问题:
在这里,我绘制了两个存储在文本文件中的数据集(在列表dataset
中),每个数据集包含218亿个数据点.这使得数据太大,无法作为数组保存在内存中.我仍然可以将它们绘制为直方图,但是我不确定如何通过
Here I plot 2 datasets stored in textfiles (in list dataset
) each containing 21.8 billion data points. This makes the data too large to hold in memory as an array. I am still able to graph them as histograms, but I'm unsure how to calculate their difference via a 2 sample KS test. This is because I cannot figure out how to access each histogram in the plt object.
示例:
下面是一些用于生成虚拟数据的代码:
Here is some code to generate dummy data:
mu = [100, 120]
sigma = 30
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for idx, file in enumerate(dataset):
dist = np.random.normal(mu[idx], sigma, 10000)
with open(file, 'w') as g:
for s in dist:
g.write('{}\t{}\t{}\n'.format('stuff', 'stuff', str(s)))
This generates my two histograms (made possible here):
chunksize = 1000
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for fh in dataset:
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
plt.savefig('gsl_test_hist.svg')
问题:
大多数示例KS统计信息使用两个数组原始数据/观测/点/等,但是我没有足够的内存来使用这种方法.根据上面的示例,如何访问这些预先计算的bin(从'gsl_test_1.txt'
和'gsl_test_2.txt'
来计算两个分布之间的KS统计信息?
Most examples for KS-statistics employ two arrays of raw data/observations/points/etc, but I don't have enough memory to use this approach. Per the example above, how can I access these precomputed bins (from 'gsl_test_1.txt'
and 'gsl_test_2.txt'
to compute the KS statistic between the two distributions?
奖励业力: 在图表上记录KS统计信息和pvalue!
Bonus karma: Record the KS statistic and pvalue on the graph!
推荐答案
我稍微清理了一下代码.写入StringIO
,因此比写入文件更加精简.使用seaborn
而不是matplotlib
设置默认的氛围,使它看起来更现代.如果您希望统计测试排队,则两个样本的bins
阈值应相同.我认为,如果您进行迭代并以这种方式制作垃圾箱,则整个过程可能会花费比所需时间更长的时间. Counter
可能很有用,因为您只需循环一次即可,而且您还可以制作相同的料箱大小.将浮点数转换为整数,因为您将它们合并在一起. from collections import Counter
,然后是C = Counter()
和C[value] += 1
.您将在末尾有一个dict
,您可以在其中创建list(C.keys())
的垃圾箱.因为您的数据太陈旧了,所以这会很好.同样,您应该查看是否有一种方法可以用numpy
而不是pandas
b/c numpy
来进行chunksize
索引.为DF.iloc[i,j]
和ARRAY[i,j]
尝试%timeit
,您将明白我的意思.我写了很多它的功能来尝试使其更具模块化.
i cleaned up your code a bit. writing to StringIO
so it's more streamline than writing to a file. set the default vibe w/ seaborn
instead of matplotlib
to make it look more modern. the bins
thresholds should be the same for both samples if you want the stat test to line up. i think if you iterate through and make the bins this way the whole thing may take way longer than it needs to. Counter
could be useful b/c you'll only have to loop through once...plus you'll be able to make the same bin size. converting floats to ints since you are binning them together. from collections import Counter
then C = Counter()
and C[value] += 1
. you'll have a dict
at the end where you can make the bins from the list(C.keys())
. this would be good since your data is so gnarly. also, you should see if there is a way to do chunksize
with numpy
instead of pandas
b/c numpy
is WAY faster at indexing. try a %timeit
for DF.iloc[i,j]
and ARRAY[i,j]
and you'll see what i mean. i wrote much of it a function to try making it more modular.
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()
%matplotlib inline
#Added seaborn b/c it looks mo betta
mu = [100, 120]
sigma = 30
def write_random(file,mu,sigma=30):
dist = np.random.normal(mu, sigma, 10000)
for i,s in enumerate(dist):
file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
return(file)
#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)
chunksize = 1000
def make_hist(fh,ax):
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)
return(ax,bin_edges,total)
#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()
test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)
#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))
这篇关于从两个预先计算的直方图报告两个样本的K-S统计量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!