快速删除只有一个不同值的数据框列 [英] quickly drop dataframe columns with only one distinct value
问题描述
有没有比下面的代码更快的方法来删除仅包含一个不同值的列?
Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
对于大型数据帧,这确实非常慢.从逻辑上讲,这实际上是在达到2个不同的值后才停止计数的情况下,对每列中的值数量进行计数.
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
推荐答案
You can use Series.unique()
method to find out all the unique elements in a column, and for columns whose .unique()
returns only 1
element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
一种不会就地删除的方法-
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
演示-
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
计时结果-
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
最快的方法似乎仍然是使用unique
并遍历各列的方法.
The fastest method still seems to be the method using unique
and looping through the columns.
这篇关于快速删除只有一个不同值的数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!