在DataFrame上选择多个横截面的正确方法 [英] The right way to select multiple cross-sections on a DataFrame
问题描述
我有一个MultiIndex DataFrame,可以在上面选择有趣的横截面.该代码可以工作,但是在大型数据集上运行缓慢,这使我觉得我做错了什么.本质上,我已经将多个横截面连接到一个新的DataFrame中,并且我正在寻找一种更好的方法.
I have a MultiIndex DataFrame on which I am selecting interesting cross-sections. The code works, but is slow on large datasets which makes me think I'm doing something wrong. Essentially I have been concatenating multiple cross-sections into a new DataFrame, and I am looking for a better way.
import pandas as pd
import numpy as np
import itertools
# setup dataset
event = ['event0', 'event1', 'event2']
node = ['n0', 'n1', 'n2', 'n3']
config = ['a', 'b']
data = []
for x in itertools.product(*[event, node, config]):
data.append([x[0], x[1], x[2], np.random.randn()])
df = pd.DataFrame(data, columns=['event', 'node', 'config', 'value'])
dfi = df.set_index(['event', 'node'])
print dfi.head(n=12)
如下所示:
config value
event node
event0 n0 a 1.256259
n0 b 0.612465
n1 a 1.593518
n1 b -0.747131
n2 a 0.719973
n2 b 1.063480
n3 a -0.943120
n3 b 2.021804
event1 n0 a -1.427104
n0 b -0.440886
n1 a 0.168212
n1 b -1.084987
一些分析
我进行了一些分析,得出了我关心的索引列表:
Some Analysis
I do some analysis which gives me a list of indexes that I care about:
# Find interesting (event,node)
g = df.groupby(['event', 'node'])['value']
gmin = g.min()
idxs = gmin[(gmin<-1.2)].index
print idxs
#idxs = [(u'event1', u'n0'), (u'event1', u'n2'), (u'event2', u'n0')]
以及笨拙的横截面
现在,我只关心有趣的事件,节点组合.这是在真实数据集上较慢的部分.每个.xs
可能需要100毫秒,但它们的总和为:
And the clumsy cross-sections
Now I just care about the interesting event, node combinations. This is the part which is slow on real data sets. Each .xs
might take 100ms, but they add up:
df2 = pd.concat([dfi.xs(idx) for idx in idxs])
print df2
哪个给出了有趣(事件,节点)横截面的每种配置的值:
Which gives the value for every configuration of the interesting (event, node) cross section:
config value
event node
event1 n0 a -1.427104
n0 b -0.440886
n2 a 0.273871
n2 b -1.224801
event2 n0 a -1.297496
n0 b -1.087568
参考文献
- 类似的问题建议控制板.我无法找出合适的索引来完成这项工作.
References
In [11]: g = df.groupby(['event', 'node'])
In [12]: g.filter(lambda x: x['value'].min() < -1.2)
Out[12]:
event node config value
0 event0 n0 a -1.566442
1 event0 n0 b -1.652915
14 event1 n3 a 1.685070
15 event1 n3 b -3.205499
20 event2 n2 a -3.007079
21 event2 n2 b 0.159409
(我的数字是不同的,因为它们是随机生成的!)
然后您可以将索引设置为事件,并将节点设置为得到您想要的结果:
You can then set the index to event and node to get your desired result:
In [13]: g.filter(lambda x: x['value'].min() < - 1.2).set_index(['event', 'node'])
Out[13]:
config value
event node
event0 n0 a -1.566442
n0 b -1.652915
event1 n3 a 1.685070
n3 b -3.205499
event2 n2 a -3.007079
n2 b 0.159409
这篇关于在DataFrame上选择多个横截面的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!