按行值pandas的特定组合计算行数 [英] count rows by certain combination of row values pandas
问题描述
我有一个像这样的数据框 (df):
I have a dataframe (df) like this:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
0 -5 -2.5
0 -5 -1.25
0 -5 -1.875
如果具有某些/相同的 v1
和 v2
,则行在同一块中.在这种情况下,行带有([0,-30], [0,-10], [0,-5])
.我想将行分成块并计算该块中的行数.如果行的长度不是6,则移除整个chunk,否则保留这个chunk.
The rows are in the same chunk if with certain/same v1
and v2
. In this case, rows with([0,-30], [0,-10], [0,-5])
. I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.
我的粗略代码:
v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls:
chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]
if len(chunk[i])!= 6:
df = df[df != chunk[i]]
else:
pass
预期输出:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
谢谢!
推荐答案
我认为在 v1
和 v2
中没有 NaN
,所以使用 transform
+ 尺寸
:
I think in v1
and v2
are no NaN
s, so use transform
+ size
:
df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
v1 v2 v3
0 0 -30 -15.00000
1 0 -30 -7.50000
2 0 -30 -11.25000
3 0 -30 -13.12500
4 0 -30 -14.06250
5 0 -30 -13.59375
6 0 -10 -5.00000
7 0 -10 -7.50000
8 0 -10 -6.25000
9 0 -10 -5.62500
10 0 -10 -5.93750
11 0 -10 -6.09375
详情:
print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 False
13 False
14 False
Name: v2, dtype: bool
不幸的是filter
真的很慢,所以如果需要更好的性能使用transform
:
Unfortunately filter
is really slow, so if need better performance use transform
:
np.random.seed(123)
N = 1000000
L = list('abcdefghijkl')
df = pd.DataFrame({'v1': np.random.choice(L, N),
'v2':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))
In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop
In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop
In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop
<小时>
N = 1000000
ngroups = 1000
df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))
In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop
In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop
警告
考虑到组的数量,结果并未解决性能问题,这将对其中一些解决方案的时间产生很大影响.
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.
这篇关于按行值pandas的特定组合计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!