如何从不等长列表的字典创建数据帧,并截断到特定长度? [英] How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?

查看:67
本文介绍了如何从不等长列表的字典创建数据帧,并截断到特定长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 listsdict(长度可变),我期待着一种有效的方法来从中创建数据帧.

假设我有最小列表长度,所以我可以在创建 Dataframe 时截断更大列表的大小.

这是我的虚拟代码

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}最小长度 = 3

我可以有一个包含 10k 或 20k 键的字典,因此正在寻找一种有效的方法来创建如下所示的 DataFrame

<预><代码>>>>dfa b c0 1 1 21 2 2 452 3 3 67

解决方案

可以在dict comprehension中过滤dictvalues,然后DataFrame 完美运行:

print ({k:v[:min_length] for k,v in data_dict.items()}){'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})打印 (df)a b c0 1 1 21 2 2 452 3 3 67

如果可能,一些长度可以小于 min_length 添加 Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}最小长度 = 3df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})打印 (df)a b c0 1 1.0 21 2 2.0 452 3 南 67

时间:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))最慢的运行时间比最快的运行时间长 5.32 倍.这可能意味着正在缓存中间结果.1000 个循环,最好的 3 个:每个循环 520 µs在 [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))最慢的运行时间比最快的运行时间长 4.50 倍.这可能意味着正在缓存中间结果.1000 个循环,最好的 3 个:每个循环 937 µs#艾伦的解决方案在 [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())1 个循环,最好的 3 个:每个循环 16.7 秒

计时代码:

np.random.seed(123)L = list('ABCDEFGH')N = 500000最小长度 = 10000data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}

I have a dict of lists (which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.

Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.

Here is my dummy code

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3

I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow

>>> df
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

解决方案

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}


df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
   a    b   c
0  1  1.0   2
1  2  2.0  45
2  3  NaN  67

Timings:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

Code for timings:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}

这篇关于如何从不等长列表的字典创建数据帧,并截断到特定长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆