删除行内的Pandas重复值,替换为NaN,将NaN移至行尾 [英] Removing Pandas duplicate values within rows, replace with NaNs, shifting NaNs to the end of rows
问题描述
问题:
如何在Pandas数据框中分别考虑每一行(并可能用NaN代替)从每一行中删除重复的单元格值?
如果我们可以将所有新创建的NaN移到每一行的末尾,那就更好了.
参考文献:相关但不同的文章:
- 有关如何删除整行的帖子,这些行被视为重复:
-
最后,如果保留每个行中最初出现的值的顺序不重要,则可以使用
numpy
.要删除重复数据,请排序然后检查差异.然后创建一个输出数组,将值向右移动.因为此方法将始终返回4列,所以在每行少于4个唯一值的情况下,我们需要dropna
来匹配其他输出.def with_numpy(df):arr = np.sort(df.to_numpy(),轴= 1)r = np.roll(arr,1,轴= 1)r [:, 0] = np.NaNarr = np.where((arr!= r),arr,np.NaN)#将所有NaN移到右侧.信用@Divakarmask = pd.notnull(arr)justified_mask = np.flip(np.sort(掩码,轴= 1),1)out = np.full(arr.shape,np.NaN,dtype = object)out [justified_mask] = arr [mask]返回pd.DataFrame(out,index = df.index).dropna(how ='all',axis ='columns')with_numpy(df)#0 1 2 3#0 A B C D#1 A C D NaN#2 B C NaN NaN#B/c排序,B在C之前#3 A B NaN NaN
perfplot.show(setup = lambda n:pd.DataFrame(np.random.choice(list('ABCD'),(n,4)),column = list('abcd')),内核= [lambda df:stack(df),lambda df:with_numpy(df),],labels = ['stack','with_numpy'],n_range = [2 ** k对于范围(3,22)中的k],#延迟检查以处理字符串/NaN,并且与排序顺序无关.equal_check = lambda x,y:(np.sort(x.fillna('ZZ').to_numpy(),1)== np.sort(y.fillna('ZZ').to_numpy(),1)).all(),xlabel ='len(df)')
Problem:
How to remove duplicate cell values from each row, considering each row separately (and perhaps replace them with NaNs) in a Pandas dataframe?
It would be even better if we could shift all newly created NaNs to the end of each row.
References: related but different posts:
- Posts on how to remove entire rows which are deemed duplicate:
- Post on how to remove duplicates from a list which is in a Pandas column:
- Remove duplicates from rows and columns (cell) in a dataframe, python
- (that answer returns a series of strings, not a dataframe)
- Remove duplicates from rows and columns (cell) in a dataframe, python
Example:
import pandas as pd df = pd.DataFrame({'a': ['A', 'A', 'C', 'B'], 'b': ['B', 'D', 'B', 'B'], 'c': ['C', 'C', 'C', 'A'], 'd': ['D', 'D', 'B', 'A']}, index=[0, 1, 2, 3])
which creates this
df
:a b c d 0 A B C D 1 A D C D 2 C B C B 3 B B A A (Printed using this.)
One solution:
One way of dropping duplicates from each row, considering each row separately:
df = df.apply(lambda row: pd.Series(row).drop_duplicates(keep='first'),axis='columns')
using apply(), a lambda function, pd.Series(), & Series.drop_duplicates().
Shove all NaNs to the end of each row, using Shift NaNs to the end of their respective rows:
df.apply(lambda x : pd.Series(x[x.notnull()].values.tolist()+x[x.isnull()].values.tolist()),axis='columns')
Output (as desired):
0 1 2 3 0 A B C D 1 A D C nan 2 C B nan nan 3 B A nan nan
Question: Is there a more efficient way to do this? Perhaps with some built-in Pandas functions?
解决方案You can
stack
and thendrop_duplicates
that way. Then we need to pivot with the help of acumcount
level. Thestack
preserves the order the values appear in along the rows and thecumcount
ensures that theNaN
will appear in the end.df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates() df1['col'] = df1.groupby('level_0').cumcount() df1 = (df1.pivot(index='level_0', columns='col', values=0) .rename_axis(index=None, columns=None)) 0 1 2 3 0 A B C D 1 A D C NaN 2 C B NaN NaN 3 B A NaN NaN
Timings
Assuming 4 columns, let's see how a bunch of these methods compare as the number of rows grow. The
map
andapply
solutions have a good advantage when things are small, but they become a bit slower than the more involvedstack
+drop_duplicates
+pivot
solution as the DataFrame gets longer. Regardless, they all start to take a while for a large DataFrame.import perfplot import pandas as pd import numpy as np def stack(df): df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates() df1['col'] = df1.groupby('level_0').cumcount() df1 = (df1.pivot(index='level_0', columns='col', values=0) .rename_axis(index=None, columns=None)) return df1 def apply_drop_dup(df): return pd.DataFrame.from_dict(df.apply(lambda x: x.drop_duplicates().tolist(), axis=1).to_dict(), orient='index') def apply_unique(df): return pd.DataFrame(df.apply(pd.Series.unique, axis=1).tolist()) def list_map(df): return pd.DataFrame(list(map(pd.unique, df.values))) perfplot.show( setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)), columns=list('abcd')), kernels=[ lambda df: stack(df), lambda df: apply_drop_dup(df), lambda df: apply_unique(df), lambda df: list_map(df), ], labels=['stack', 'apply_drop_dup', 'apply_unique', 'list_map'], n_range=[2 ** k for k in range(18)], equality_check=lambda x,y: x.compare(y).empty, xlabel='~len(df)' )
Finally, if preserving the order in which the values originally appeared within each row is unimportant, you can use
numpy
. To de-duplicate you sort then check for differences. Then create an output array that shifts values to the right. Because this method will always return 4 columns, we require adropna
to match the other output in the case that every row has fewer than 4 unique values.def with_numpy(df): arr = np.sort(df.to_numpy(), axis=1) r = np.roll(arr, 1, axis=1) r[:, 0] = np.NaN arr = np.where((arr != r), arr, np.NaN) # Move all NaN to the right. Credit @Divakar mask = pd.notnull(arr) justified_mask = np.flip(np.sort(mask, axis=1), 1) out = np.full(arr.shape, np.NaN, dtype=object) out[justified_mask] = arr[mask] return pd.DataFrame(out, index=df.index).dropna(how='all', axis='columns') with_numpy(df) # 0 1 2 3 #0 A B C D #1 A C D NaN #2 B C NaN NaN # B/c this method sorts, B before C #3 A B NaN NaN
perfplot.show( setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)), columns=list('abcd')), kernels=[ lambda df: stack(df), lambda df: with_numpy(df), ], labels=['stack', 'with_numpy'], n_range=[2 ** k for k in range(3, 22)], # Lazy check to deal with string/NaN and irrespective of sort order. equality_check=lambda x, y: (np.sort(x.fillna('ZZ').to_numpy(), 1) == np.sort(y.fillna('ZZ').to_numpy(), 1)).all(), xlabel='len(df)' )
这篇关于删除行内的Pandas重复值,替换为NaN,将NaN移至行尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-