从不等长列表的字典创建一个DataFrame [英] create a DataFrame from dict of unequal length lists

查看:310
本文介绍了从不等长列表的字典创建一个DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表的字典(长度是可变的),我期待一种从中创建数据框的有效方法.
假设我的列表长度最小,因此我可以截断更大列表的大小创建数据框时.
这是我的伪代码

I have a dict of lists(which are having variable length), i am looking forward to an efficient way of creating a Dataframe from it.
Assume i have minimum list length, so i can truncate size of bigger lists while creating Dataframe.
here is my dummy code

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3

我可以拥有10k或20k键的字典,因此正在寻找一种有效的方式来创建像波纹管这样的DataFrame

i can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow

>>> df
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

推荐答案

您可以过滤dict comprehensiondictvalues,然后过滤

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}


df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

如果可能的话,某些长度可以小于min_length添加Series:

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
   a    b   c
0  1  1.0   2
1  2  2.0  45
2  3  NaN  67

时间:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

计时代码:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}

这篇关于从不等长列表的字典创建一个DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆