通过行值 pandas 的某些组合来计算行 [英] count rows by certain combination of row values pandas
问题描述
我有一个这样的数据框(df):
I have a dataframe (df) like this:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
0 -5 -2.5
0 -5 -1.25
0 -5 -1.875
如果v1
和v2
相同/相同,则这些行位于同一块中.在这种情况下,带([0,-30], [0,-10], [0,-5])
的行.我想将行拆分为多个块,并计算该块中的行数.如果行的长度不是6,则删除整个块,否则,请保留该块.
The rows are in the same chunk if with certain/same v1
and v2
. In this case, rows with([0,-30], [0,-10], [0,-5])
. I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.
我的粗略代码:
v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls:
chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]
if len(chunk[i])!= 6:
df = df[df != chunk[i]]
else:
pass
预期输出:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
谢谢!
推荐答案
我认为v1
和v2
中都不是NaN
,因此请使用 size
:
I think in v1
and v2
are no NaN
s, so use transform
+ size
:
df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
v1 v2 v3
0 0 -30 -15.00000
1 0 -30 -7.50000
2 0 -30 -11.25000
3 0 -30 -13.12500
4 0 -30 -14.06250
5 0 -30 -13.59375
6 0 -10 -5.00000
7 0 -10 -7.50000
8 0 -10 -6.25000
9 0 -10 -5.62500
10 0 -10 -5.93750
11 0 -10 -6.09375
详细信息:
print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 False
13 False
14 False
Name: v2, dtype: bool
不幸的是,filter
确实很慢,因此如果需要更好的性能,请使用transform
:
Unfortunately filter
is really slow, so if need better performance use transform
:
np.random.seed(123)
N = 1000000
L = list('abcdefghijkl')
df = pd.DataFrame({'v1': np.random.choice(L, N),
'v2':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))
In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop
In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop
In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop
N = 1000000
ngroups = 1000
df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))
In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop
In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop
注意事项
给定组数,结果无法解决性能问题,其中某些解决方案的时序会受到很大影响.
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.
这篇关于通过行值 pandas 的某些组合来计算行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!