将大数据帧转换为nd.array,执行spearman corr [英] Converting large dataframe to nd.array, doing spearman corr
问题描述
我有一个大数据,其中包括作为索引的样本和作为标头的名称(500 X 30000).例如:
I have a large data consists of samples as index and names as header (500 X 30000). eg:
Name1 Name2 Name3
Sample1 232.12 0.239 -0.324
Sample2 0.928 23.213 -0.056
Sample3 -0.231 7.7776 -0.984
我想要得到的东西:
Name1 Name2 Name3
Name1 1 0.001 corr val
Name2 corr val 1 corr val
Name3 corr val corr val 1
等.
我想到了:
np.corrcoef(data)
但这是皮尔逊"不仅如此,而且我在声明数据量过大时遇到错误.
But it's "pearsons" only and also I am getting an error claiming the data to large.
我尝试将其拆分
lst = []
data = For_spearman.to_numpy()
#data = np.delete(data, (0), axis=0)
data_size = len(data)-1
for key1 in range(1, data_size): #Ignoring first column which is index
if key1 != data_size-1: # Cant compare after the last row, so -1 and -1.
for key2 in range(key1+1 ,data_size): # Comparing name1 vs name2
test = scipy.stats.spearmanr(data[key1][1:], data[key2][1:])
lst .append([data[key1][0], data[key2][0], test])
pd.DataFrame(lst ).to_csv('ForSpearman.csv')
但是我只是一团糟,因为我总是以某种方式被nd.array缠住.我该如何做"np.corrcoef"?工作,但在长矛手"中方式并将其拆分,以便每次将一个数组与另一个数组进行比较?
But I just getting a mess as I am always getting tangled by nd.array somehow.. How can I do "np.corrcoef" job but in "spearman" way and splitting it so it will compare an array to another array each time ?
推荐答案
您遇到了问题,您试图创建一个30000 x 30000的矩阵,仅7.2GB.对于中间阵列,16GB可能不足.但是,一种方法是循环.这会很慢,但可能在您的系统上可行:
There's your problem, you are trying to create a 30000 x 30000 matrix, which alone is 7.2GB. 16GB might not be sufficient for intermediate arrays. One way though, is to loop. It will be slow but probably doable on your system:
df = pd.DataFrame(np.random.rand(500, 30000))
out = pd.DataFrame(index=df.columns, columns = df.columns)
# you can also loop in chunks of columns
for col in df:
out[col] = df.corrwith(df[col], method='spearman')
更新:以下内容可能会减少内存需求
Update: The following might be less memory requirement
out = pd.concat([df.corrwith(df[col], method='spearman')
.to_frame(name=col) for col in df.columns],
axis=1)
尽管如此,我认为在这种情况下12〜16GB是相当有限的.另外,循环将永远花费.
Nevertheless, I think 12~16GB is pretty limited in this case. Also, looping would take forever.
这篇关于将大数据帧转换为nd.array,执行spearman corr的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!