如何从不等长列表的字典创建数据帧,并截断到特定长度? [英] How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?
问题描述
我有一个 lists
的 dict
(长度可变),我期待着一种有效的方法来从中创建数据帧.
假设我有最小列表长度,所以我可以在创建 Dataframe 时截断更大列表的大小.
这是我的虚拟代码
data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}最小长度 = 3
我可以有一个包含 10k 或 20k 键的字典,因此正在寻找一种有效的方法来创建如下所示的 DataFrame
<预><代码>>>>dfa b c0 1 1 21 2 2 452 3 3 67可以在dict comprehension
中过滤dict
的values
,然后DataFrame
完美运行:
print ({k:v[:min_length] for k,v in data_dict.items()}){'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})打印 (df)a b c0 1 1 21 2 2 452 3 3 67
如果可能,一些长度可以小于 min_length
添加 Series
:
data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}最小长度 = 3df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})打印 (df)a b c0 1 1.0 21 2 2.0 452 3 南 67
时间:
In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))最慢的运行时间比最快的运行时间长 5.32 倍.这可能意味着正在缓存中间结果.1000 个循环,最好的 3 个:每个循环 520 µs在 [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))最慢的运行时间比最快的运行时间长 4.50 倍.这可能意味着正在缓存中间结果.1000 个循环,最好的 3 个:每个循环 937 µs#艾伦的解决方案在 [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())1 个循环,最好的 3 个:每个循环 16.7 秒
计时代码:
np.random.seed(123)L = list('ABCDEFGH')N = 500000最小长度 = 10000data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
I have a dict
of lists
(which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.
Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.
Here is my dummy code
data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3
I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow
>>> df
a b c
0 1 1 2
1 2 2 45
2 3 3 67
You can filter values
of dict
in dict comprehension
, then DataFrame
works perfectly:
print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}
df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
a b c
0 1 1 2
1 2 2 45
2 3 3 67
If is possible some length can be less as min_length
add Series
:
data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3
df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
a b c
0 1 1.0 2
1 2 2.0 45
2 3 NaN 67
Timings:
In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop
In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop
#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop
Code for timings:
np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000
data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
这篇关于如何从不等长列表的字典创建数据帧,并截断到特定长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!