pandas :使用groupby和函数过滤DataFrame [英] Pandas: DataFrame filtering using groupby and a function

查看：153 发布时间：2020/5/24 0:30:43 python python-3.x pandas

本文介绍了 pandas :使用groupby和函数过滤DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用Python 3.3和Pandas 0.10

我有一个通过串联多个CSV文件构建的DataFrame.首先，我过滤掉名称"列中包含特定字符串的所有值.结果看起来像这样(为简洁起见，为了方便起见，实际上有更多的列):

Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...

现在，我的问题是我想删除重复"值的特殊情况.我想删除所有ID重复项(实际上是整行)，其中映射到此ID的相应Name值不相似.在上面的示例中，我想保留ID为1、2和3的行.在ID = 4的情况下，Name值不相等，我想删除它们.

我尝试使用以下代码行(基于此处的建议:解决方案

我认为您要考虑每个组中Name的唯一值的数量，而不是长度len.使用nunique()，并检查此整洁的配方以过滤组.

df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]

如果升级到熊猫0.12，则可以在组上使用新的filter方法，这将使该方法更加简洁明了.

df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)

一般性说明:当然，有时候您确实想知道组的长度，但是我发现size比len是更安全的选择，在某些情况下这对我来说很麻烦. /p>

Using Python 3.3 and Pandas 0.10

I have a DataFrame that is built from concatenating multiple CSV files. First, I filter out all values in the Name column that contain a certain string. The result looks something like this (shortened for brevity sakes, actually there are more columns):

Name    ID
'A'     1
'B'     2
'C'     3
'C'     3
'E'     4
'F'     4
...     ...

Now my issue is that I want to remove a special case of 'duplicate' values. I want to remove all ID duplicates (entire row actually) where the corresponding Name values that are mapped to this ID are not similar. In the example above I would like to keep rows with ID 1, 2 and 3. Where ID=4 the Name values are unequal and I want to remove those.

I tried to use the following line of code (based on the suggestion here: Python Pandas: remove entries based on the number of occurrences).

Code:

df[df.groupby('ID').apply(lambda g: len({x for x in g['Name']})) == 1]

However that gives me the error: ValueError: Item wrong length 51906 instead of 109565!

Edit:

Instead of using apply() I have also tried using transform(), however that gives me the error: AttributeError: 'int' object has no attribute 'ndim'. An explanation on why the error is different per function is very much appreciated!

Also, I want to keep keep all rows where ID = 3 in the above example.

Thanks in advance, Matthijs

解决方案

Instead of length len, I think you want to consider the number of unique values of Name in each group. Use nunique(), and check out this neat recipe for filtering groups.

df[df.groupby('ID').Name.transform(lambda x: x.nunique() == 1).astype('bool')]

If you upgrade to pandas 0.12, you can use the new filter method on groups, which makes this more succinct and straightforward.

df.groupby('ID').filter(lambda x: x.Name.nunique() == 1)

A general remark: Sometimes, of course, you do want to know the length of the group, but I find that size is a safer choice than len, which has been troublesome for me in some cases.

这篇关于 pandas :使用groupby和函数过滤DataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas :使用groupby和函数过滤DataFrame [英] Pandas: DataFrame filtering using groupby and a function

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas :使用groupby和函数过滤DataFrame [英] Pandas: DataFrame filtering using groupby and a function

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭