设置数据框之间的列差异 [英] Set differences on columns between dataframes
问题描述
注意:这个问题的灵感来自另一篇文章中讨论的想法:Pandas 中的 DataFrame 代数
Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas
假设我有两个数据框 A
和 B
并且对于某些列 col_name
,它们的值为:
Say I have two dataframes A
and B
and that for some column col_name
, their values are:
A[col_name] | B[col_name]
--------------| ------------
1 | 3
2 | 4
3 | 5
4 | 6
我想根据 col_name
计算 A
和 B
之间的集合差.这个操作的结果应该是:
I want to compute the set difference between A
and B
based on col_name
. The result of this operation should be:
A
的行,其中 A[col_name]
与 B[col_name]
中的任何条目都不匹配.
The rows of A
where A[col_name]
didn't match any entries in B[col_name]
.
以下是上述示例的结果(也显示了 A
的其他列):
Below is the result for the above example (showing other columns of A
as well):
A[col_name] | A[other_column_1] | A[other_column_2]
------------+-------------------|------------------
1 | 'foo' | 'xyz' ....
2 | 'bar' | 'abc'
请记住,A[col_name]
和 B[col_name]
中的某些条目可能包含值 np.NaN
.我想将这些条目视为未定义但不同的,即集合差异应该返回它们.
Keep in mind that some entries in A[col_name]
and B[col_name]
could hold the value np.NaN
. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.
我怎样才能在 Pandas 中做到这一点?(概括为多列上的差异也很好)
How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)
推荐答案
一种方法是使用 Series isin
方法:
One way is to use the Series isin
method:
In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B'])
In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])
现在可以检查df1['A']
中的每一项是否在df2['A']
中:
Now you can check whether each item in df1['A']
is in of df2['A']
:
In [13]: df1['A'].isin(df2['A'])
Out[13]:
0 False
1 False
2 True
3 True
Name: A, dtype: bool
In [14]: df1[~df1['A'].isin(df2['A'])] # not in df2['A']
Out[14]:
A B
0 1 foo
1 2 bar
我认为这也符合您对 NaN 的要求:
I think this does what you want for NaNs too:
In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B'])
In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A'])
In [23]: df1[~df1['A'].isin(df2['A'])]
Out[23]:
A B
0 1.0 foo
1 NaN bar
3 NaN baz
注意:对于大型框架,可能值得将这些列设为索引(以按照 另一个问题).
合并两个或多个列的一种方法是使用虚拟列:
One way to merge on two or more columns is to use a dummy column:
In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B'])
In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B'])
In [33]: cols = ['A', 'B']
In [34]: df2['dummy'] = df2[cols].isnull().any(1) # rows with NaNs in cols will be True
In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left')
In [36]: merged
Out[36]:
A B dummy
0 1 foo NaN
1 NaN bar True
2 4 meh False
3 NaN eurgh NaN
布尔值存在于 df2 中,True 在合并列之一中具有 NaN.按照您的规范,我们应该删除那些错误的:
The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:
In [37]: merged.loc[merged.dummy != False, df1.columns]
Out[37]:
A B
0 1 foo
1 NaN bar
3 NaN eurgh
不优雅.
这篇关于设置数据框之间的列差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!