如果一行满足条件,则从pandas数据框中删除级别及其所有行 [英] Remove level and all of its rows from pandas dataframe if one row meets condition
问题描述
下面是我要过滤的熊猫数据框.当年份中至少一行(即visit
)的温度为<时,我想删除年份及其所有行. 37.我可以在2014年删除温度为36的特定行;但是,我不知道如何度过整个一年.
Below is a pandas dataframe that I would like to filter. I would like to remove the year and all of its rows when the temp for at least one row (i.e., visit
) in that year is < 37. I am able to remove the specific row in 2014 where the temp is 36; however, I do not know how to make the entire year go away.
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['yr',
'visit'])
columns = pd.MultiIndex.from_product(['hr', 'temp'], names=['metric'])
data = pd.DataFrame([[96, 38], [98, 38], [85, 36], [84, 43]], index=index,
columns=columns)
data
metric hr temp
yr visit
2013 1 96 38
2 98 38
2014 1 85 36
2 84 43
所需的输出:
metric hr temp
yr visit
2013 1 96 38
2 98 38
推荐答案
You could use groupby/filter
to remove groups based on a condition:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['yr', 'visit'])
columns = pd.MultiIndex.from_product([['hr', 'temp']], names=['metric'])
data = pd.DataFrame([[96, 38], [98, 38], [85, 36], [84, 43]], index=index, columns=columns)
print(data.groupby(level='yr').filter(lambda x: (x['temp']>=37).all()))
收益
metric hr temp
yr visit
2013 1 96 38
2 98 38
由于要删除的行按yr
分组,并且yr
是索引级别,因此请使用groupby(level='yr')
.对于每个组,在将x
设置为sub-DataFrame组的情况下调用lambda
函数.该组在以下情况下保留
(x['temp']>=37).all())
是True
.
Since the rows you wish to remove are grouped by yr
and the yr
is a level of the index, use groupby(level='yr')
. For each group the lambda
function is called with x
set to the sub-DataFrame group. The group is kept when
(x['temp']>=37).all())
is True
.
is faster, particularly for large DataFrames, since data['temp']>=37
computes the criterion in a vectorized way for the entire column whereas in my solution above, (x['temp']>=37).all()
computes the criterion in a piecemeal fashion for each sub-DataFrame separately. Generally, vectorized solutions are faster when applied to large arrays or NDFrames instead of in a loop on smaller pieces.
以下是显示1000行DataFrame速度差异的示例:
Here is an example showing the difference in speed for a 1000-row DataFrame:
In [70]: df = pd.DataFrame(np.random.randint(100, size=(1000, 4)), columns=list('ABCD')).set_index(['A','B'])
In [71]: %timeit df.groupby(level='A').filter(lambda x: (x['C']>=5).all())
10 loops, best of 3: 46.3 ms per loop
In [72]: %timeit df.loc[(df['C']>=37).groupby(level='A').transform('all')]
100 loops, best of 3: 18.9 ms per loop
这篇关于如果一行满足条件,则从pandas数据框中删除级别及其所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!