Dask.dataframe或替代方法:删除低频项目行的可扩展方式 [英] Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

查看:97
本文介绍了Dask.dataframe或替代方法:删除低频项目行的可扩展方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来从包含低频项的数据框中删除行.我从帖子中修改了以下代码段:

I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame 
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)

问题是,此代码似乎无法缩放.

The problem is, that this code does not scale, it seems.

to_remove = value_counts[value_counts <= threshold].index现在已经为我的数据运行了几个小时(2 GB压缩HDFStore).因此,我需要一个更好的解决方案.理想情况下是核心外的.我怀疑dask.dataframe是合适的,但我无法以敏捷的方式表达以上代码. dask.dataframe中没有按键功能stackreplace.

The line to_remove = value_counts[value_counts <= threshold].index has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe is suitable, but I fail to express the above code in terms of dask. The key functions stack and replace are absent from dask.dataframe.

我尝试了以下方法(在普通大熊猫中工作)来解决缺少这两个功能的问题:

I tried the following (works in normal pandas) to work around the lack of these two functions:

value_countss = [df[col].value_counts() for col in df.columns]

infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]

rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values  for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)

df.drop(rows_to_drop)

但是,这实际上对dask无效.它在infrequent_items.keys()处出错.

That does not actually work with dask though. It errors at infrequent_items.keys().

即使它确实奏效了,但考虑到这是优雅的对立面,我怀疑一定有更好的方法.

Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.

你能建议点什么吗?

推荐答案

下面的代码结合了Evan的改进,解决了我的问题:

The following code, which incorporates Evan's improvement, solves my issue:

unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))

to_remove = {k for k, v in d.items() if v < threshold}

mask = df.isin(to_remove)

column_mask = (~mask).all(axis=1)

df = df[column_mask]

演示:

def filter_low_frequency(df, threshold=4):
    unique, counts = np.unique(df.values.ravel(), return_counts=True)
    d = dict(zip(unique, counts))
    to_remove = {k for k, v in d.items() if v < threshold}
    mask = df.isin(to_remove)
    column_mask = (~mask).all(axis=1)
    df = df[column_mask]
    return df


df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)


    0   1   2   3   4   5   6   7   8   9
0   3  17  11  13   8   8  15  14   7   8
1   2  14  11   3  16  10  19  19  14   4
2   8  13  13  17   3  13  17  18   5  18
3   7   8  14   9  15  12   0  15   2  19
4   6  12  13  11  16   6  19  16   2  17
5   2   1   2  17   1   3  12  10   2  16
6   0  19   9   4  15   3   3   3   4   0
7  18   8  15   9   1  18  15  17   9   0
8  17  15   9  11  13   9  11   4  19   8
9  13   6   7   8   8  10   0   3  16  13
8     9
3     8
13    8
17    7
15    7
19    6
2     6
9     6
11    5
16    5
0     5
18    4
4     4
14    4
10    3
12    3
7     3
6     3
1     3
5     1
dtype: int64
    0   1  2   3   4  5   6  7   8  9
6   0  19  9   4  15  3   3  3   4  0
8  17  15  9  11  13  9  11  4  19  8

这篇关于Dask.dataframe或替代方法:删除低频项目行的可扩展方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆