在 pandas 中的两个数据框之间选择唯一行 [英] Selecting Unique Rows between Two DataFrames in Pandas
问题描述
我有两个尺寸不等的数据框A和B.我想创建一个数据框C,使其仅包含A和B之间唯一的行.我尝试遵循此解决方案(
I have two data frames A and B of unequal dimensions. I would like to create a data frame C such that it ONLY contains rows that are unique between A and B. I tried to follow this solution (excluding rows from a pandas dataframe based on column value and not index value) but could not get it to work.
这里是示例:
假设这是DF_A:
Star_ID Loc_ID pmRA pmDE Field Jmag Hmag
2M00000032+5737103 4264 0.000000 0.000000 N7789 10.905 10.635
2M00000068+5710233 4264 8.000000 -18.000000 N7789 10.664 10.132
2M00000222+5625359 4264 0.000000 0.000000 N7789 11.982 11.433
2M00000818+5634264 4264 0.000000 0.000000 N7789 12.501 11.892
2M00001242+5524391 4264 0.000000 -4.000000 N7789 12.091 11.482
这就是DF_B:
2M00000032+5737103
2M00000068+5710233
2M00001242+5524391
因此,前两个Star_ID和最后一个Star_ID在DF_A和DF_B之间是公用的.我想创建DF_C这样:
So, the first two and last Star_ID are common between DF_A and DF_B. I would like to create DF_C such that:
DF_C:
Star_ID Loc_ID pmRA pmDE Field Jmag Hmag
2M00000222+5625359 4264 0.000000 0.000000 N7789 11.982 11.433
2M00000818+5634264 4264 0.000000 0.000000 N7789 12.501 11.892
推荐答案
这对我有用:
In [7]:
df1[~df1.Star_ID.isin(df2.Star_ID)]
Out[7]:
Star_ID Loc_ID pmRA pmDE Field Jmag Hmag
2 2M00000222+5625359 4264 0 0 N7789 11.982 11.433
3 2M00000818+5634264 4264 0 0 N7789 12.501 11.892
[2 rows x 7 columns]
所以我们在这里做的是创建一个布尔掩码,我们要求两个数据帧中的Star_ID
值在哪里,但是通过使用~
我们NOT
条件实际上使它无效.您链接到的链接几乎是同一件事,但我认为您可能不了解语法?
So what we do here is we create a boolean mask, we ask for where Star_ID
values is in both dataframes, however by using the ~
we NOT
the condition which in effect negates it. The one you linked to is pretty much the same thing but I think you maybe didn't understand the syntax?
编辑
为了同时获得仅在df1中的值和仅在df2中的值,您可以这样做
In order to get both values that are only in df1 and values that are only in df2 you could do this
unique_vals = df1[~df1.Star_ID.isin(df2.Star_ID)].append(df2[~df2.Star_ID.isin(df1.Star_ID)], ignore_index=True)
进一步编辑
所以问题是csv包含前导空格,这导致所有值在两个数据集中都是唯一的,要更正此错误,您需要执行以下操作:
So the problem was that the csv had leading spaces, this caused all values to be unique in both datasets, to correct this you need to do this:
df1.Apogee_ID = df1.Apogee_ID.str.lstrip()
这篇关于在 pandas 中的两个数据框之间选择唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!