在大 pandas 数据帧中组合重复列ID [英] Group duplicate column IDs in pandas dataframe

查看:137
本文介绍了在大 pandas 数据帧中组合重复列ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在有很多类似的问题,但大多数人都回答了如何删除重复的列。但是,我想知道如何创建一个元组列表,其中每个元组包含重复列的列名称。我假设每一列都有一个唯一的名字。只是为了进一步说明我的问题:

  df = pd.DataFrame({'A':[1,2,3,4 ,5],'B':[2,4,2,1,9],
'C':[1,2,3,4,5],'D':[2,4,2 ,1,9],
'E':[3,4,2,1,2],'F':[1,1,1,1,1]},
index = [ 'a1','a2','a3','a4','a5'])

然后我想输出:

  [('A','C'),('B','D' ] 

如果你今天感觉很好,那么也将同样的问题扩展到行。如何获取每个元组包含重复行的元组列表。

解决方案

这是一个NumPy方法 -

  def group_duplicate_cols(df):
a = df.values
sidx = np.lexsort(a)
b = a [:,sidx]

m = np.concatenate(([False],(b [:,1:] == b [:,, - 1])全部(0),[False ]))
idx = np.flatnonzero(m [1:]!= m [: - 1])$ ​​b $ b C = df.columns [sidx] .tolist()
return [C [i:j] for i,j in zip(idx [:: 2],idx [1 :: 2] +1)]

样本运行 -

 在[100]中:df 
输出[100] :
ABCDEF
a1 1 2 1 2 3 1
a2 2 4 2 4 4 1
a3 3 2 3 2 2 1
a4 4 1 4 1 1 1
a5 5 9 5 9 2 1

在[101]:group_duplicate_cols(df)
出[101]:[['A','C'],[' B','D']]

#我们再添加一个副本到包含'A'
的组中[102]:df.F = df.A

在[103] :group_duplicate_cols(df)
Out [103]:[['A','C','F'],['B','D']]

转换为做同样的操作,但是对于行(索引),我们只需要沿着另一个轴切换操作,就像这样 -

  def group_duplicate_rows(df):
a = df.values
sidx = np.lexsort(aT)
b = a [sidx]

m = np.concatenate(([False],(b [1:] == b [: - 1])。all(1),[False]) b $ b idx = np.flatnonzero(m [1:]!= m [: - 1])$ ​​b $ b C = df.index [sidx] .tolist()
return [C [i:j ] for i,j in zip(idx [:: 2],idx [1 :: 2] +1)]

样本运行 -

 在[260]中:df2 
输出[260]:
a1 a2 a3 a4 a5
A 3 5 3 4 5
B 1 1 1 1 1
C 3 5 3 4 5
D 2 9 2 1 9
E 2 2 2 1 2
F 1 1 1 1 1

在[261]中:group_duplicate_rows(df2)
输出[261]:[['B','F '],['A','C']]






基准



方法 -

 #@John Galt的soln-1 
来自itertools导入组合
def combine_app(df):
return [x for x in combination(df.columns,2)if(df [x [0]] == df [x [-1]])all()]

#@ Abdou的soln
def pandas_groupby_app(df):
return [tuple(d.index)for _,d in df.T.groupby(list(df.T .columns))如果len(d)> 1]

#@ COLDSPEED的soln
def triu_app(df):
c = df.columns.tolist()
i,j = np.triu_indices(len(c ),1)$(b,b)中的_i,_j的
x = [(c [_i],c [_j])if(df [c [_i]] == df [c [_j]] ).all()]
返回x

#@ cmaher的soln
def lambda_set_app(df):
返回列表(filter(lambda x:len(x) > 1,list(set([tuple([x for df.columns if df [x] == df [y])])for d in df.columns]))))

注意: @John Galt的soln-2 未包含因为输入的大小为$ code>(8000,500)会炸毁与广播。 p>

计时 -

 在[179]中:#设置输入的大小为在问题
...中提到:df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
...:df.columns = ['C '+ str(i)for i in range(df.shape [1])]
...:idx0 = np.random.choice(df.shape [1 ],df.shape [1] // 2,replace = 0)
...:idx1 = np.random.choice(df.shape [1],df.shape [1] // 2,replace = 0)
...:df.iloc [:,idx0] = df.iloc [:,idx1] .values
...:

#@John Galt's soln-1
在[180]中:%timeit combination_app(df)
1循环,最好3:24.6 s每循环

#@ Abdou's soln
In [181]:%timeit pandas_groupby_app(df)
1循环,最好3:3.81 s每循环

#@ COLDSPEED的soln
在[182]中:%timeit triu_app df)
1循环,最好3:25.5 s每循环

#@ cmaher的soln
在[183]​​:%timeit lambda_set_app(df)
1循环,最好的3:27.1 s每循环

#在这篇文章中提出
在[184]中:%timeit group_duplicate_cols(df)
10循环,最好的3:188 ms每循环






使用NumPy的视图进行超级提升功能



利用NumPy的视图功能,可以将每组元素视为一个dtype,我们可以在更显着的性能提升,像这样 -

  def view1D(a):#a是数组
a = np。 ascontiguousarray(a)
void_dt = np.dtype((np.void,a.dtype.itemsize * a.shape [1]))
返回a.view(void_dt).ravel()

def group_duplicate_cols_v2(df):
a = df.values
sidx = view1D(aT).argsort()
b = a [:,sidx]

m = np.concatenate(([False],(b [:,1:] == b [:, - - 1])全部(0),[False]))
idx = .flatnonzero(m [1:]!= m [: - 1])$ ​​b $ b C = df.columns [sidx] .tolist()
return [C [i:j] for i,j in zip(idx [:: 2],idx [1 :: 2] +1)]

-

 在[322]中:%timeit group_duplicate_cols(df)
10循环,最好3:185 ms循环

在[323]中:%timeit group_duplicate_cols_v2(df)
10循环,最好3:69.3 ms每循环

只是疯狂加速!


Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
                   'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
                   'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
                   index = ['a1', 'a2', 'a3', 'a4', 'a5'])

then I want the output:

[('A', 'C'), ('B', 'D')]

And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.

解决方案

Here's one NumPy approach -

def group_duplicate_cols(df):
    a = df.values
    sidx = np.lexsort(a)
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample runs -

In [100]: df
Out[100]: 
    A  B  C  D  E  F
a1  1  2  1  2  3  1
a2  2  4  2  4  4  1
a3  3  2  3  2  2  1
a4  4  1  4  1  1  1
a5  5  9  5  9  2  1

In [101]: group_duplicate_cols(df)
Out[101]: [['A', 'C'], ['B', 'D']]

# Let's add one more duplicate into group containing 'A'
In [102]: df.F = df.A

In [103]: group_duplicate_cols(df)
Out[103]: [['A', 'C', 'F'], ['B', 'D']]

Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so -

def group_duplicate_rows(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.index[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample run -

In [260]: df2
Out[260]: 
   a1  a2  a3  a4  a5
A   3   5   3   4   5
B   1   1   1   1   1
C   3   5   3   4   5
D   2   9   2   1   9
E   2   2   2   1   2
F   1   1   1   1   1

In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]


Benchmarking

Approaches -

# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
    return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]

# @Abdou's soln
def pandas_groupby_app(df):
    return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]                        

# @COLDSPEED's soln
def triu_app(df):
    c = df.columns.tolist()
    i, j = np.triu_indices(len(c), 1)
    x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
    return x

# @cmaher's soln
def lambda_set_app(df):
    return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))

Note : @John Galt's soln-2 wasn't included because the inputs being of size (8000,500) would blow up with the proposed broadcasting for that one.

Timings -

In [179]: # Setup inputs with sizes as mentioned in the question
     ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
     ...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
     ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: df.iloc[:,idx0] = df.iloc[:,idx1].values
     ...: 

# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop

# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop

# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop

# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop

# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop


Super boost with NumPy's view functionality

Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so -

def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

def group_duplicate_cols_v2(df):
    a = df.values
    sidx = view1D(a.T).argsort()
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Timings -

In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop

In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop

Just crazy speedups!

这篇关于在大 pandas 数据帧中组合重复列ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆