将大数据帧转换为nd.array,执行spearman corr [英] Converting large dataframe to nd.array, doing spearman corr

查看:68
本文介绍了将大数据帧转换为nd.array,执行spearman corr的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据,其中包括作为索引的样本和作为标头的名称(500 X 30000).例如:

I have a large data consists of samples as index and names as header (500 X 30000). eg:

          Name1    Name2    Name3
Sample1   232.12   0.239    -0.324
Sample2   0.928    23.213   -0.056
Sample3   -0.231   7.7776   -0.984

我想要得到的东西:

          Name1    Name2    Name3
Name1      1        0.001    corr val
Name2      corr val   1      corr val
Name3      corr val  corr val   1

等.

我想到了:

np.corrcoef(data)

但这是皮尔逊"不仅如此,而且我在声明数据量过大时遇到错误.

But it's "pearsons" only and also I am getting an error claiming the data to large.

我尝试将其拆分

lst = []
data = For_spearman.to_numpy()
#data = np.delete(data, (0), axis=0)
data_size = len(data)-1
for key1 in range(1, data_size): #Ignoring first column which is index
    if key1 != data_size-1: # Cant compare after the last row, so -1 and -1.
        for key2 in range(key1+1 ,data_size): # Comparing name1 vs name2
            test = scipy.stats.spearmanr(data[key1][1:], data[key2][1:])
            lst .append([data[key1][0], data[key2][0], test])
            pd.DataFrame(lst ).to_csv('ForSpearman.csv')

但是我只是一团糟,因为我总是以某种方式被nd.array缠住.我该如何做"np.corrcoef"?工作,但在长矛手"中方式并将其拆分,以便每次将一个数组与另一个数组进行比较?

But I just getting a mess as I am always getting tangled by nd.array somehow.. How can I do "np.corrcoef" job but in "spearman" way and splitting it so it will compare an array to another array each time ?

推荐答案

您遇到了问题,您试图创建一个30000 x 30000的矩阵,仅7.2GB.对于中间阵列,16GB可能不足.但是,一种方法是循环.这会很慢,但可能在您的系统上可行:

There's your problem, you are trying to create a 30000 x 30000 matrix, which alone is 7.2GB. 16GB might not be sufficient for intermediate arrays. One way though, is to loop. It will be slow but probably doable on your system:

df = pd.DataFrame(np.random.rand(500, 30000))

out = pd.DataFrame(index=df.columns, columns = df.columns)

# you can also loop in chunks of columns
for col in df:
    out[col] = df.corrwith(df[col], method='spearman')


更新:以下内容可能会减少内存需求


Update: The following might be less memory requirement

out = pd.concat([df.corrwith(df[col], method='spearman')
                   .to_frame(name=col) for col in df.columns],
                 axis=1)

尽管如此,我认为在这种情况下12〜16GB是相当有限的.另外,循环将永远花费.

Nevertheless, I think 12~16GB is pretty limited in this case. Also, looping would take forever.

这篇关于将大数据帧转换为nd.array,执行spearman corr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆