Dask.dataframe或替代方法:删除低频项目行的可扩展方式 [英] Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items
问题描述
我正在寻找一种方法来从包含低频项的数据框中删除行.我从此帖子中修改了以下代码段:
I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
问题是,此代码似乎无法缩放.
The problem is, that this code does not scale, it seems.
行to_remove = value_counts[value_counts <= threshold].index
现在已经为我的数据运行了几个小时(2 GB压缩HDFStore).因此,我需要一个更好的解决方案.理想情况下是核心外的.我怀疑dask.dataframe
是合适的,但我无法以敏捷的方式表达以上代码. dask.dataframe
中没有按键功能stack
和replace
.
The line to_remove = value_counts[value_counts <= threshold].index
has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe
is suitable, but I fail to express the above code in terms of dask. The key functions stack
and replace
are absent from dask.dataframe
.
我尝试了以下方法(在普通大熊猫中工作)来解决缺少这两个功能的问题:
I tried the following (works in normal pandas) to work around the lack of these two functions:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
但是,这实际上对dask无效.它在infrequent_items.keys()
处出错.
That does not actually work with dask though. It errors at infrequent_items.keys()
.
即使它确实奏效了,但考虑到这是优雅的对立面,我怀疑一定有更好的方法.
Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.
你能建议点什么吗?
推荐答案
下面的代码结合了Evan的改进,解决了我的问题:
The following code, which incorporates Evan's improvement, solves my issue:
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
演示:
def filter_low_frequency(df, threshold=4):
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
return df
df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)
0 1 2 3 4 5 6 7 8 9
0 3 17 11 13 8 8 15 14 7 8
1 2 14 11 3 16 10 19 19 14 4
2 8 13 13 17 3 13 17 18 5 18
3 7 8 14 9 15 12 0 15 2 19
4 6 12 13 11 16 6 19 16 2 17
5 2 1 2 17 1 3 12 10 2 16
6 0 19 9 4 15 3 3 3 4 0
7 18 8 15 9 1 18 15 17 9 0
8 17 15 9 11 13 9 11 4 19 8
9 13 6 7 8 8 10 0 3 16 13
8 9
3 8
13 8
17 7
15 7
19 6
2 6
9 6
11 5
16 5
0 5
18 4
4 4
14 4
10 3
12 3
7 3
6 3
1 3
5 1
dtype: int64
0 1 2 3 4 5 6 7 8 9
6 0 19 9 4 15 3 3 3 4 0
8 17 15 9 11 13 9 11 4 19 8
这篇关于Dask.dataframe或替代方法:删除低频项目行的可扩展方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!