按行值pandas的特定组合计算行数 [英] count rows by certain combination of row values pandas

查看:32
本文介绍了按行值pandas的特定组合计算行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的数据框 (df):

I have a dataframe (df) like this:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375
   0    -5  -2.5
   0    -5  -1.25
   0    -5  -1.875

如果具有某些/相同的 v1v2,则行在同一块中.在这种情况下,行带有([0,-30], [0,-10], [0,-5]).我想将行分成块并计算该块中的行数.如果行的长度不是6,则移除整个chunk,否则保留这个chunk.

The rows are in the same chunk if with certain/same v1 and v2. In this case, rows with([0,-30], [0,-10], [0,-5]). I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.

我的粗略代码:

v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls: 
   chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]

   if len(chunk[i])!= 6:
      df = df[df != chunk[i]]
   else:
      pass

预期输出:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -13.59375
   0    -10 -5
   0    -10 -7.5
   0    -10 -6.25
   0    -10 -5.625
   0    -10 -5.9375
   0    -10 -6.09375

谢谢!

推荐答案

我认为在 v1v2 中没有 NaN,所以使用 transform + 尺寸:

I think in v1 and v2 are no NaNs, so use transform + size:

df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
    v1  v2        v3
0    0 -30 -15.00000
1    0 -30  -7.50000
2    0 -30 -11.25000
3    0 -30 -13.12500
4    0 -30 -14.06250
5    0 -30 -13.59375
6    0 -10  -5.00000
7    0 -10  -7.50000
8    0 -10  -6.25000
9    0 -10  -5.62500
10   0 -10  -5.93750
11   0 -10  -6.09375

详情:

print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14    False
Name: v2, dtype: bool

不幸的是filter真的很慢,所以如果需要更好的性能使用transform:

Unfortunately filter is really slow, so if need better performance use transform:

np.random.seed(123)
N = 1000000
L = list('abcdefghijkl') 
df = pd.DataFrame({'v1': np.random.choice(L, N),
                   'v2':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))

In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop

In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop

In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop

<小时>

N = 1000000

ngroups = 1000

df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))

In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop

In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop

警告

考虑到组的数量,结果并未解决性能问题,这将对其中一些解决方案的时间产生很大影响.

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

这篇关于按行值pandas的特定组合计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆