进行 Pandas 数据框列查找的有效方法 [英] Efficient way to do the Pandas dataframe column lookup
问题描述
我有以下数据框:
+-------------------------------------------+----------------------------------------+----------------+----------------------------------+
| Lookup | LookUp Value 1 | LookUp Value 2 | LookUp Value 3 |
+-------------------------------------------+----------------------------------------+----------------+----------------------------------+
| 300000,50000,500000,100000,1000000,200000 | -1820,-1820,-1820,-1820,-1820,-1820 | 1,1,1,1,1,1 | 1820,1820,1820,1820,1820,1820 |
| 100000,1000000,200000,300000,50000,500000 | -1360,-28760,-1360,-28760,-1360,-28760 | 2,3,2,3,2,3 | 4120,31520,4120,31520,4120,31520 |
+-------------------------------------------+----------------------------------------+----------------+----------------------------------+
每一列都是一个列表,第一列是查找键,其余是查找值.我想像这样生成数据框.
Each column is a list, the first columns is the lookup key and the rest are the lookup value. I would like to generate the dataframe like this.
+--------------------+--------------------+--------------------+
| Lookup_300K_Value1 | Lookup_300K_Value2 | Lookup_300K_Value3 |
+--------------------+--------------------+--------------------+
| -1820 | 1 | 1820 |
| -28760 | 3 | 31520 |
+--------------------+--------------------+--------------------+
实际上我有一个使用 pandas.apply 并逐行处理的解决方案.它非常非常慢,所以我想看看是否有一些解决方案可以加快进程?非常感谢.
Actually I have a solution using pandas.apply and process row by row. It is very very slow so I would like to see if there are some solution that could speed up the process? Thank you very much.
我在下面添加了数据帧生成代码
I added the dataframe generation code below
d = {'Lookup_Key': ['300000,50000,500000,100000,1000000,200000', '100000,1000000,200000,300000,50000,500000'],
'LookUp_Value_1': ['-1820,-1820,-1820,-1820,-1820,-1820', '-1360,-28760,-1360,-28760,-1360,-28760'],
'LookUp_Value_2': ['1,1,1,1,1,1', '2,3,2,3,2,3'],
'LookUp_Value_3': ['1820,1820,1820,1820,1820,1820', '4120,31520,4120,31520,4120,31520']}
df = pd.DataFrame(data=d)
推荐答案
在某些列中使用缺失值测试的解决方案,但在 Lookup
中不是 NaN 或无:
Solution tested with missing values in some column(s), but in Lookup
are not NaNs or Nones:
df = pd.concat([df[x].str.split(',', expand=True).stack() for x in df.columns], axis=1, keys=df.columns)
df = df.reset_index(level=1, drop=True).set_index('Lookup', append=True).unstack().sort_index(axis=1, level=1)
df.columns = [f'{b}_{a}' for a, b in df.columns]
想法是在循环中拆分每个值,为系列爆炸并连接在一起,最后通过stack
重新整形:
Idea is split each value in loop, explode for Series and concat together, last reshape by stack
:
df = pd.concat([df[x].str.split(',').explode() for x in df.columns], axis=1)
df = df.set_index('Lookup', append=True).unstack().sort_index(axis=1, level=1)
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
100000_LookUp Value 1 100000_LookUp Value 2 100000_LookUp Value 3 \
0 -1820 1 1820
1 -1360 2 4120
1000000_LookUp Value 1 1000000_LookUp Value 2 1000000_LookUp Value 3 \
0 -1820 1 1820
1 -28760 3 31520
200000_LookUp Value 1 200000_LookUp Value 2 200000_LookUp Value 3 \
0 -1820 1 1820
1 -1360 2 4120
300000_LookUp Value 1 300000_LookUp Value 2 300000_LookUp Value 3 \
0 -1820 1 1820
1 -28760 3 31520
50000_LookUp Value 1 50000_LookUp Value 2 50000_LookUp Value 3 \
0 -1820 1 1820
1 -1360 2 4120
500000_LookUp Value 1 500000_LookUp Value 2 500000_LookUp Value 3
0 -1820 1 1820
1 -28760 3 31520
这篇关于进行 Pandas 数据框列查找的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!