如何有效地从字典创建一个大而稀疏的数据帧? [英] How to create a large but sparse dataframe from a dict efficiently?

查看:60
本文介绍了如何有效地从字典创建一个大而稀疏的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大但非常稀疏的矩阵(50,000 行 * 100,000 列,只有 10% 的值是已知的).该矩阵的每个已知元素都是从 0.00 到 1.00 的浮点数,这些已知值存储在 python dict 中,格式如下:

{'c1': {'r1':0.27, 'r3':0.45},'c2': {'r2':0.65, 'r4':0.87} }

现在的问题是如何有效地从这个字典构造一个pandas.DataFrame?这里,效率包括内存使用和构建数据帧的时间.

对于内存使用,我希望通过 np.uint8 存储每个元素.因为已知值是从 0.00 到 1.00,我只关心前 2 位数字,所以我可以通过乘以 100 将其转换为无符号的 8 位整数.这可能会为该数据帧节省大量内存存储.

有没有办法做到这一点?

解决方案

A dict like:

{'c1': {'r1':0.27, 'r3':0.45},'c2': {'r2':0.65, 'r4':0.87} }

... 最好转换成这样的规范化结构:

 level0 level1 值c1 r1 0.27c1 r3 0.45c2 r2 0.65c2 r4 0.87

...而不是像这样的数据透视表:

 r1 r2 r3 r4c1 0.27 毫安 0.45 毫安c2 南 0.65 南 0.87

...因为后者需要更多的内存.

构建规范化结构的合理内存效率方法是:

input = {'c1': {'r1':0.27, 'r3':0.45},'c2': {'r2':0.65, 'r4':0.87} }结果 = []对于 key,input.iteritems() 中的值:row = pd.Series(value).reset_index()row.insert(0, 'key', key)结果.追加(行)pd.concat(结果,ign​​ore_index=True)

这导致:

 键索引 00 c2 r2 0.651 c2 r4 0.872 c1 r1 0.273 c1 r3 0.45

I have a large but very sparse matrix(50,000 rows*100,000 columns, only 10% of the values are known). Each known element of this matrix is a float number from 0.00 to 1.00 and these known values are stored in a python dict with a format like:

{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }

Now the problem is how to construct a pandas.DataFrame from this dict efficiently? Here, efficiency includes both memory usage and time for constructing dataframe.

For memory usage, I'm hoping to store each element by np.uint8. Because the known value is from 0.00 to 1.00 and I only care about the first 2 digits, so I could cast it to a unsigned 8-bit integer via multiplying by 100. This might save a lot of memory storage for this dataframe.

Is there any way to do this?

解决方案

A dict like:

{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }

... is best converted into a normalised structure like this:

 level0    level1   value
 c1        r1        0.27
 c1        r3        0.45
 c2        r2        0.65
 c2        r4        0.87

... than a pivot table like this:

      r1    r2    r3    r4
c1  0.27   nan  0.45   nan
c2   nan  0.65   nan  0.87

... since the latter takes much more memory.

A reasonably memory-efficient way of constructing the normalised structure is:

input = {'c1': {'r1':0.27, 'r3':0.45}, 
         'c2': {'r2':0.65, 'r4':0.87} }

result = []
for key, value in input.iteritems():
    row = pd.Series(value).reset_index()
    row.insert(0, 'key', key)
    result.append(row)

pd.concat(result, ignore_index=True)

This results in:

  key index     0
0  c2    r2  0.65
1  c2    r4  0.87
2  c1    r1  0.27
3  c1    r3  0.45

这篇关于如何有效地从字典创建一个大而稀疏的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆