如何比较基于列的两个不同大小的数据帧? [英] How to compare two data frames of different size based on a column?
问题描述
我有两个大小不同的数据帧
I have two data frames with different size
df1
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
和
df2
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510753 1669 6 4 0 ... 0 NaN 33.400 73.200
1 1720.535519 1720 7 15 0 ... 0 NaN 28.700 77.200
2 1780.000000 1780 0 0 0 ... 0 NaN 35.000 77.000
3 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600
4 1803.665753 1803 9 1 0 ... 0 NaN 30.300 78.800
5 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600.
1.我想根据"YearDeci"列比较df1和df2.并找出常见条目和唯一条目(除常见行之外的其他行).
1.I wanted to compare df1 and df2 based on the column 'YearDeci'. and find out the common entries and unique entries(rows other than common rows).
2.根据"YearDeci"列输出df1中的公共行(相对于df2).
2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'.
3.根据"YearDeci"列在df1中输出唯一行(相对于df2).
3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'.
* NB:"YearDeci"中的小数位数差异最多可以达到 +/- 0.0001
*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable
预期输出就像
row_common =
row_common=
YearDeci Year Month Day ... Mb Lat Lon
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
row_unique =
row_unique=
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
推荐答案
首先在每个都有"上比较 df1.YearDeci 和 df2.YearDeci 原则.要执行比较,请使用 np.isclose 函数(假定绝对值)容忍.
First compare df1.YearDeci with df2.YearDeci on the "each with each" principle. To perform comparison use np.isclose function with the assumed absolute tolerance.
结果是一个 boolean 数组:
- 第一个索引- df1 中的索引,
- 第二个索引- df2 中的索引.
- first index - index in df1,
- second index - index in df2.
然后使用 np.argwhere 查找 True 值的索引,即索引 df1 和 df2 中的相关"行的集合,并根据它们创建一个DateFrame.
Then, using np.argwhere, find indices of True values, i.e. indices of "correlated" rows from df1 and df2 and create a DateFrame from them.
执行上述操作的代码为:
The code to perform the above operations is:
ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
columns=['ind1', 'ind2'])
然后,在两个DataFrame中都有成对的索引指向相关"行,执行以下合并:
Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:
result = ind.merge(df1, left_on='ind1', right_index=True)\
.merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])
最后一步是删除两个辅助索引列"( ind1 和 ind2 ):
The final step is to drop both "auxiliary index columns" (ind1 and ind2):
result.drop(columns=['ind1', 'ind2'], inplace=True)
结果(分为2部分)是:
The result (divided into 2 parts) is:
YearDeci_1 Year_1 Month_1 Day_1 Magnitude Lat_1 Lon_1 YearDeci_2 \
0 1720.535519 1720 7 15 6.5 28.37 77.09 1720.535519
Year_2 Month_2 Day_2 Hour Seconds Mb Lat_2 Lon_2
0 1720 7 15 0 0 NaN 28.7 77.2
这篇关于如何比较基于列的两个不同大小的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!