numpy unique无法过滤掉特定列上具有相同值的组 [英] numpy unique could not filter out groups with the same value on a specific column

查看:99
本文介绍了numpy unique无法过滤掉特定列上具有相同值的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试groupby一个df,然后选择在特定列上值不相同且组大小> 1的组.

I tried to groupby a df and then select groups who do not have the same value on a specific column and whose group size > 1,

df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))

df的样子,请注意,某些account_no字符串只有一个空格,ext_idint_id也是字符串,amountfloat;

the df looks like, note that some account_no strings only have a single space, ext_id and int_id are also strings, amount is float;

account_no    ext_id    amount        int_id
              2665057   439.504062     D000192
              2665057   439.504062     D000192
              353724    2758.92        952
              353724    2758.92        952

该代码应该返回一个空的df,因为示例中的所有行均不满足此处的条件,但是带有int_id = 292的行仍然存在,因此如何在此处解决此问题.

the code supposed to return an empty df, since none of the rows in the sample satisfy the conditions here, but the rows with int_id = 292 remained, so how to fix the issue here.

ps. numpy 1.14.3pandas 0.22.0python 3.5.2

推荐答案

在我看来,存在一些消除空白或类似问题.

In my opinion there is problem some traling whitespace or similar.

您可以检查它:

df = pd.DataFrame({'account_no': ['a', 'a', 'a', 'a'], 
                   'ext_id': [2665057, 2665057, 353724, 353724], 
                   'amount': [439.50406200000003, 439.50406200000003, 2758.92, 2758.92], 
                   'int_id': ['D000192', 'D000192', ' 952', '952']})
print (df)
  account_no       amount   ext_id   int_id
0          a   439.504062  2665057  D000192
1          a   439.504062  2665057  D000192
2          a  2758.920000   353724      952
3          a  2758.920000   353724      952

df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
  account_no   amount  ext_id int_id
2          a  2758.92  353724    952
3          a  2758.92  353724    952

print (df1['int_id'].tolist())
[' 952', '952']

然后通过str.strip将其删除:

df['int_id'] = df['int_id'].str.strip()
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
Empty DataFrame
Columns: [account_no, amount, ext_id, int_id]
Index: []

这篇关于numpy unique无法过滤掉特定列上具有相同值的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆