从pandas.dataframe删除低频值 [英] Remove low frequency values from pandas.dataframe

查看:79
本文介绍了从pandas.dataframe删除低频值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从pandas.DataFrame的列中删除很少出现的值,即出现频率较低的值?示例:

How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:

In [4]: df[col_1].value_counts()

Out[4]: 0       189096
        1       110500
        2        77218
        3        61372
              ...
        2065         1
        2067         1
        1569         1
        dtype: int64

所以,我的问题是:如何删除2065, 2067, 1569等值?对于包含这样的.value_counts()的所有列,我该怎么做?

So, my question is: how to remove values like 2065, 2067, 1569 and others? And how can I do this for ALL columns, that contain .value_counts() like this?

更新: 关于低",我指的是像2065这样的值.该值出现col_1 1(一)次,我想删除这样的值.

UPDATE: About 'low' I mean values like 2065. This value occurs in col_1 1 (one) times and I want to remove values like this.

推荐答案

我看到您可能有两种方法可以做到这一点.

I see there are two ways you might want to do this.

对于整个DataFrame

此方法删除整个DataFrame中很少出现的值.我们可以使用内置函数来加快处理速度,而无需循环.

This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame 
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)

逐列

此方法删除每个列中不经常出现的条目.

This method removes the entries that occur infrequently in each column.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
    value_counts = df[col].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    df[col].replace(to_remove, np.nan, inplace=True)

这篇关于从pandas.dataframe删除低频值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆