删除行内的Pandas重复值,替换为NaN,将NaN移至行尾 [英] Removing Pandas duplicate values within rows, replace with NaNs, shifting NaNs to the end of rows

查看:52
本文介绍了删除行内的Pandas重复值,替换为NaN,将NaN移至行尾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

如何在Pandas数据框中分别考虑每一行(并可能用NaN代替)从每一行中删除重复的单元格值?

如果我们可以将所有新创建的NaN移到每一行的末尾,那就更好了.


参考文献:相关但不同的文章:

  • 有关如何删除整行的帖子,这些行被视为重复:

    • 最后,如果保留每个行中最初出现的值的顺序不重要,则可以使用 numpy .要删除重复数据,请排序然后检查差异.然后创建一个输出数组,将值向右移动.因为此方法将始终返回4列,所以在每行少于4个唯一值的情况下,我们需要 dropna 来匹配其他输出.

        def with_numpy(df):arr = np.sort(df.to_numpy(),轴= 1)r = np.roll(arr,1,轴= 1)r [:, 0] = np.NaNarr = np.where((arr!= r),arr,np.NaN)#将所有NaN移到右侧.信用@Divakarmask = pd.notnull(arr)justified_mask = np.flip(np.sort(掩码,轴= 1),1)out = np.full(arr.shape,np.NaN,dtype = object)out [justified_mask] = arr [mask]返回pd.DataFrame(out,index = df.index).dropna(how ='all',axis ='columns')with_numpy(df)#0 1 2 3#0 A B C D#1 A C D NaN#2 B C NaN NaN#B/c排序,B在C之前#3 A B NaN NaN 


        perfplot.show(setup = lambda n:pd.DataFrame(np.random.choice(list('ABCD'),(n,4)),column = list('abcd')),内核= [lambda df:stack(df),lambda df:with_numpy(df),],labels = ['stack','with_numpy'],n_range = [2 ** k对于范围(3,22)中的k],#延迟检查以处理字符串/NaN,并且与排序顺序无关.equal_check = lambda x,y:(np.sort(x.fillna('ZZ').to_numpy(),1)== np.sort(y.fillna('ZZ').to_numpy(),1)).all(),xlabel ='len(df)') 

      Problem:

      How to remove duplicate cell values from each row, considering each row separately (and perhaps replace them with NaNs) in a Pandas dataframe?

      It would be even better if we could shift all newly created NaNs to the end of each row.


      References: related but different posts:


      Example:

      import pandas as pd
      df = pd.DataFrame({'a': ['A', 'A', 'C', 'B'],
                         'b': ['B', 'D', 'B', 'B'],
                         'c': ['C', 'C', 'C', 'A'],
                         'd': ['D', 'D', 'B', 'A']},
                         index=[0, 1, 2, 3])
      

      which creates this df:

      a b c d
      0 A B C D
      1 A D C D
      2 C B C B
      3 B B A A

      (Printed using this.)


      One solution:

      One way of dropping duplicates from each row, considering each row separately:

      df = df.apply(lambda row: pd.Series(row).drop_duplicates(keep='first'),axis='columns')
      

      using apply(), a lambda function, pd.Series(), & Series.drop_duplicates().

      Shove all NaNs to the end of each row, using Shift NaNs to the end of their respective rows:

      df.apply(lambda x : pd.Series(x[x.notnull()].values.tolist()+x[x.isnull()].values.tolist()),axis='columns') 
      

      Output (as desired):

      0 1 2 3
      0 A B C D
      1 A D C nan
      2 C B nan nan
      3 B A nan nan


      Question: Is there a more efficient way to do this? Perhaps with some built-in Pandas functions?

      解决方案

      You can stack and then drop_duplicates that way. Then we need to pivot with the help of a cumcount level. The stack preserves the order the values appear in along the rows and the cumcount ensures that the NaN will appear in the end.

      df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()
      
      df1['col'] = df1.groupby('level_0').cumcount()
      df1 = (df1.pivot(index='level_0', columns='col', values=0)
                .rename_axis(index=None, columns=None))
      
         0  1    2    3
      0  A  B    C    D
      1  A  D    C  NaN
      2  C  B  NaN  NaN
      3  B  A  NaN  NaN
      


      Timings

      Assuming 4 columns, let's see how a bunch of these methods compare as the number of rows grow. The map and apply solutions have a good advantage when things are small, but they become a bit slower than the more involved stack + drop_duplicates + pivot solution as the DataFrame gets longer. Regardless, they all start to take a while for a large DataFrame.

      import perfplot
      import pandas as pd
      import numpy as np
      
      def stack(df):
          df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()
      
          df1['col'] = df1.groupby('level_0').cumcount()
          df1 = (df1.pivot(index='level_0', columns='col', values=0)
                    .rename_axis(index=None, columns=None))
          return df1
      
      def apply_drop_dup(df):
          return pd.DataFrame.from_dict(df.apply(lambda x: x.drop_duplicates().tolist(),
                                                 axis=1).to_dict(), orient='index')
      
      def apply_unique(df):
          return pd.DataFrame(df.apply(pd.Series.unique, axis=1).tolist())
      
      
      def list_map(df):
          return pd.DataFrame(list(map(pd.unique, df.values)))
      
      
      perfplot.show(
          setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
                                       columns=list('abcd')), 
          kernels=[
              lambda df: stack(df),
              lambda df: apply_drop_dup(df),
              lambda df: apply_unique(df),
              lambda df: list_map(df),
          ],
          labels=['stack', 'apply_drop_dup', 'apply_unique', 'list_map'],
          n_range=[2 ** k for k in range(18)],
          equality_check=lambda x,y: x.compare(y).empty,  
          xlabel='~len(df)'
      )
      


      Finally, if preserving the order in which the values originally appeared within each row is unimportant, you can use numpy. To de-duplicate you sort then check for differences. Then create an output array that shifts values to the right. Because this method will always return 4 columns, we require a dropna to match the other output in the case that every row has fewer than 4 unique values.

      def with_numpy(df):
          arr = np.sort(df.to_numpy(), axis=1)
          r = np.roll(arr, 1, axis=1)
          r[:, 0] = np.NaN
          
          arr = np.where((arr != r), arr, np.NaN)
          
          # Move all NaN to the right. Credit @Divakar
          mask = pd.notnull(arr)
          justified_mask = np.flip(np.sort(mask, axis=1), 1)
          out = np.full(arr.shape, np.NaN, dtype=object) 
          out[justified_mask] = arr[mask]
          
          return pd.DataFrame(out, index=df.index).dropna(how='all', axis='columns')
      
      with_numpy(df)
      #   0  1    2    3
      #0  A  B    C    D
      #1  A  C    D  NaN
      #2  B  C  NaN  NaN     # B/c this method sorts, B before C
      #3  A  B  NaN  NaN
      


      perfplot.show(
          setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
                                       columns=list('abcd')), 
          kernels=[
              lambda df: stack(df),
              lambda df: with_numpy(df),
          ],
          labels=['stack', 'with_numpy'],
          n_range=[2 ** k for k in range(3, 22)],
          # Lazy check to deal with string/NaN and irrespective of sort order. 
          equality_check=lambda x, y: (np.sort(x.fillna('ZZ').to_numpy(), 1) 
                                       == np.sort(y.fillna('ZZ').to_numpy(), 1)).all(),
          xlabel='len(df)'
      )
      

      这篇关于删除行内的Pandas重复值,替换为NaN,将NaN移至行尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆