pandas :过滤数据框以获取过于频繁或过于稀有的值 [英] Pandas: Filter dataframe for values that are too frequent or too rare

查看：63 发布时间：2021/5/7 19:24:22 python pandas filtering selection

本文介绍了 pandas :过滤数据框以获取过于频繁或过于稀有的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在pandas数据框上，我知道我可以在一个或多个列上分组，然后过滤出现的值大于/小于给定数字的值.

但是我想在数据框的每一列上执行此操作.我要删除太少的值(假设发生少于5％的次数)或太频繁的值.例如，考虑一个具有以下列的数据框:始发城市，目的地城市，距离，运输类型(空中/汽车/脚)，一天中的时间，价格间隔.

 将pandas导入为pd导入字符串将numpy导入为npvals = [(c，np.random.choice(list(string.lowercase)，100，replace = True))'原产城市'，'目的地城市'，'距离，运输类型(空运/汽车/英尺)，'一天中的时间，价格区间']df = pd.DataFrame(dict(vals))>>df.head()目的地城市出发地城市距离，运输类型(空中/汽车/英尺)，一天中的时间，价格间隔0 f p a n1 k b a f2 q s n j3小时下午4时

如果这是一个大数据框，则删除具有虚假项目的行是有意义的，例如，如果 day of time = night 仅出现3％的时间，或者如果脚的运输方式很少见，依此类推.

我想从所有列(或列列表)中删除所有此类值.我的一个主意是在每一列上执行 value_counts ，然后进行 transform 并为每个value_counts添加一列；然后根据它们是高于还是低于阈值进行过滤.但是我认为必须有更好的方法来实现这一目标?

解决方案

此过程将遍历DataFrame的每一列，并消除给定类别小于给定阈值百分比的行，从而在每个循环中缩小DataFrame./p>

此答案与@Ami Tavory提供的答案类似，但有一些细微的差异:

它对值计数进行归一化，因此您只需使用百分位数阈值即可.
它仅计算每列一次计数，而不是两次.这样可以加快执行速度.

代码:

 阈值= 0.03对于df中的col:counts = df [col] .value_counts(normalize = True)df = df.loc [df [col] .isin(counts [counts> threshold] .index)，:]

代码计时:

  df2 = pd.DataFrame(np.random.choice(list(string.lowercase)，[1e6，4]，replace = True)，column = list('ABCD'))%% timeit df = df2.copy()阈值= 0.03对于df中的col:counts = df [col] .value_counts(normalize = True)df = df.loc [df [col] .isin(counts [counts> threshold] .index)，:]1个循环，最好为3:每个循环485毫秒%% timeit df = df2.copy()m = 0.03 * len(df)对于df中的c:df = df [df [c] .isin(df [c] .value_counts()[df [c] .value_counts()> m] .index)]]1个循环，最佳3:每个循环688毫秒

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.

But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.

import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in 
    'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
    city of destination     city of origin  distance, type of transport (air/car/foot)  time of day, price-interval
0   f   p   a   n
1   k   b   a   f
2   q   s   n   j
3   h   c   g   u
4   w   d   m   h

If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.

I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?

解决方案

This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.

This answer is similar to that provided by @Ami Tavory, but with a few subtle differences:

It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.

Code:

threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

Code timing:

df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True), 
                   columns=list('ABCD'))

%%timeit df=df2.copy()
threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

1 loops, best of 3: 485 ms per loop

%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

1 loops, best of 3: 688 ms per loop

这篇关于 pandas :过滤数据框以获取过于频繁或过于稀有的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas :过滤数据框以获取过于频繁或过于稀有的值 [英] Pandas: Filter dataframe for values that are too frequent or too rare

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas :过滤数据框以获取过于频繁或过于稀有的值 [英] Pandas: Filter dataframe for values that are too frequent or too rare

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭