如何比较基于列的两个不同大小的数据帧? [英] How to compare two data frames of different size based on a column?

查看:50
本文介绍了如何比较基于列的两个不同大小的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大小不同的数据帧

I have two data frames with different size

df1

     YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00      
2     1720.535519  1720      7   15  ...        6.5  28.37  77.09      
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00      

df2

         YearDeci  Year  Month  Day  Hour  ...  Seconds   Mb     Lat     Lon  
0     1669.510753  1669      6    4     0  ...        0  NaN  33.400  73.200    
1     1720.535519  1720      7   15     0  ...        0  NaN  28.700  77.200    
2     1780.000000  1780      0    0     0  ...        0  NaN  35.000  77.000    
3     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600    
4     1803.665753  1803      9    1     0  ...        0  NaN  30.300  78.800
5     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600.

1.我想根据"YearDeci"列比较df1和df2.并找出常见条目和唯一条目(除常见行之外的其他行).

1.I wanted to compare df1 and df2 based on the column 'YearDeci'. and find out the common entries and unique entries(rows other than common rows).

2.根据"YearDeci"列输出df1中的公共行(相对于df2).

2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'.

3.根据"YearDeci"列在df1中输出唯一行(相对于df2).

3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'.

* NB:"YearDeci"中的小数位数差异最多可以达到 +/- 0.0001

*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable

预期输出就像

row_common =

row_common=

      YearDeci     Year   Month  Day ...   Mb     Lat     Lon 
2     1720.535519  1720      7   15  ...  6.5  28.37  77.09

row_unique =

row_unique=

      YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00           
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00 

推荐答案

首先在每个都有"上比较 df1.YearDeci df2.YearDeci 原则.要执行比较,请使用 np.isclose 函数(假定绝对值)容忍.

First compare df1.YearDeci with df2.YearDeci on the "each with each" principle. To perform comparison use np.isclose function with the assumed absolute tolerance.

结果是一个 boolean 数组:

  • 第一个索引- df1 中的索引,
  • 第二个索引- df2 中的索引.
  • first index - index in df1,
  • second index - index in df2.

然后使用 np.argwhere 查找 True 值的索引,即索引 df1 df2 中的相关"行的集合,并根据它们创建一个DateFrame.

Then, using np.argwhere, find indices of True values, i.e. indices of "correlated" rows from df1 and df2 and create a DateFrame from them.

执行上述操作的代码为:

The code to perform the above operations is:

ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
    df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
    columns=['ind1', 'ind2'])

然后,在两个DataFrame中都有成对的索引指向相关"行,执行以下合并:

Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:

result = ind.merge(df1, left_on='ind1', right_index=True)\
    .merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])

最后一步是删除两个辅助索引列"( ind1 ind2 ):

The final step is to drop both "auxiliary index columns" (ind1 and ind2):

result.drop(columns=['ind1', 'ind2'], inplace=True)

结果(分为2部分)是:

The result (divided into 2 parts) is:

    YearDeci_1  Year_1  Month_1  Day_1  Magnitude  Lat_1  Lon_1   YearDeci_2  \
0  1720.535519    1720        7     15        6.5  28.37  77.09  1720.535519   

   Year_2  Month_2  Day_2  Hour  Seconds  Mb  Lat_2  Lon_2  
0    1720        7     15     0        0 NaN   28.7   77.2  

这篇关于如何比较基于列的两个不同大小的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆