将大数据帧转换为nd.array，执行spearman corr [英] Converting large dataframe to nd.array, doing spearman corr

查看：68 发布时间：2021/4/24 20:36:18 pandas correlation numpy-ndarray

本文介绍了将大数据帧转换为nd.array，执行spearman corr的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大数据，其中包括作为索引的样本和作为标头的名称(500 X 30000).例如:

I have a large data consists of samples as index and names as header (500 X 30000). eg:

          Name1    Name2    Name3
Sample1   232.12   0.239    -0.324
Sample2   0.928    23.213   -0.056
Sample3   -0.231   7.7776   -0.984

我想要得到的东西:

          Name1    Name2    Name3
Name1      1        0.001    corr val
Name2      corr val   1      corr val
Name3      corr val  corr val   1

等.

我想到了:

np.corrcoef(data)

但这是皮尔逊"不仅如此，而且我在声明数据量过大时遇到错误.

But it's "pearsons" only and also I am getting an error claiming the data to large.

我尝试将其拆分

lst = []
data = For_spearman.to_numpy()
#data = np.delete(data, (0), axis=0)
data_size = len(data)-1
for key1 in range(1, data_size): #Ignoring first column which is index
    if key1 != data_size-1: # Cant compare after the last row, so -1 and -1.
        for key2 in range(key1+1 ,data_size): # Comparing name1 vs name2
            test = scipy.stats.spearmanr(data[key1][1:], data[key2][1:])
            lst .append([data[key1][0], data[key2][0], test])
            pd.DataFrame(lst ).to_csv('ForSpearman.csv')

但是我只是一团糟，因为我总是以某种方式被nd.array缠住.我该如何做"np.corrcoef"?工作，但在长矛手"中方式并将其拆分，以便每次将一个数组与另一个数组进行比较?

But I just getting a mess as I am always getting tangled by nd.array somehow.. How can I do "np.corrcoef" job but in "spearman" way and splitting it so it will compare an array to another array each time ?

推荐答案

您遇到了问题，您试图创建一个30000 x 30000的矩阵，仅7.2GB.对于中间阵列，16GB可能不足.但是，一种方法是循环.这会很慢，但可能在您的系统上可行:

There's your problem, you are trying to create a 30000 x 30000 matrix, which alone is 7.2GB. 16GB might not be sufficient for intermediate arrays. One way though, is to loop. It will be slow but probably doable on your system:

df = pd.DataFrame(np.random.rand(500, 30000))

out = pd.DataFrame(index=df.columns, columns = df.columns)

# you can also loop in chunks of columns
for col in df:
    out[col] = df.corrwith(df[col], method='spearman')

更新:以下内容可能会减少内存需求

Update: The following might be less memory requirement

out = pd.concat([df.corrwith(df[col], method='spearman')
                   .to_frame(name=col) for col in df.columns],
                 axis=1)

尽管如此，我认为在这种情况下12〜16GB是相当有限的.另外，循环将永远花费.

Nevertheless, I think 12~16GB is pretty limited in this case. Also, looping would take forever.

这篇关于将大数据帧转换为nd.array，执行spearman corr的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将大数据帧转换为nd.array，执行spearman corr [英] Converting large dataframe to nd.array, doing spearman corr

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将大数据帧转换为nd.array，执行spearman corr [英] Converting large dataframe to nd.array, doing spearman corr

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭