Python Pandas:沿着一列比较两个数据帧,并在另一个数据帧中返回两个数据帧的行内容 [英] Python Pandas : compare two data-frames along one column and return content of rows of both data frames in another data frame

查看:212
本文介绍了Python Pandas:沿着一列比较两个数据帧,并在另一个数据帧中返回两个数据帧的行内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 我正在处理两个csv文件,并作为数据框df1和df2导入
  2. df1有50000行,而df2有150000行.
  3. 我想将df2的时间"与以下内容进行比较(逐行重复) df1,找出时间差并返回所有列的值 对应于相似的行,将其保存在df3中(时间同步)
  4. 例如,35427949712(在df1中为时间")最近或等于 35427949712(在df2中为时间"),所以我想返回 df1('velocity_x'和'yaw')和df2('velocity'和 'yawrate')并保存在df3中
  5. 为此,我使用了两种技术,如代码所示.
  6. 代码1花费很长时间才能执行72个小时,这是不可行的,因为我有很多csv文件
  7. 代码2给我内存错误",内核死了.
  1. I am working with two csv files and imported as dataframe, df1 and df2
  2. df1 has 50000 rows and df2 has 150000 rows.
  3. I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 (time synchronization)
  4. For example, 35427949712 (of 'time' in df1) is nearest or equal to 35427949712 (of 'time' in df2), So I would like to return the contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and 'yawrate') and save in df3
  5. For this i used two techniques, shown in code.
  6. Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files
  7. Code 2 gives me "memory error" and kernel dies.

如果我能在考虑计算时间,内存和功耗(Intel Core i7-6700HQ,8 GB Ram)的情况下得到更健壮的解决方案,那就太好了

这是示例数据,

import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860], 
                    'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
                   'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})

df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705], 
                    'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
                    'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})

df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])

代码1

 for index, row in df1.iterrows():
    min=100000
    for indexer, rows in df2.iterrows():
        if abs(float(row['time'])-float(rows['time']))<min:
            min = abs(float(row['time'])-float(rows['time']))
            #storing the position 
            pos = indexer
    df3.loc[index,'time'] = df1['time'][pos]
    df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
    df3.loc[index,'yaw'] = df1['yaw'][pos]
    df3.loc[index,'velocity'] = df2['velocity'][pos]
    df3.loc[index,'yawrate'] = df2['yawrate'][pos]

代码2

df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]

推荐答案

您正在寻找

You're looking for pandas.merge_asof. It allows you to combine 2 DataFrames on a key, in this case time, without the requirement that they are an exact match. You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest

最近"搜索在右侧DataFrame中选择其"on"键距左侧键绝对距离最近的行.

A "nearest" search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

一个警告是,您需要对事物进行排序以使merge_asof正常工作.

One caveat is that you need to sort things for merge_asof to work.

import pandas as pd

pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
#          time  velocity   yawrate  velocity_x       yaw
#0  35427009860   12.6556 -0.074351     12.5451 -0.078781
#1  35427029728   12.6556 -0.074351     12.5451 -0.078781
#2  35427049705   12.6444 -0.074351     12.5451 -0.078781
#3  35427929709   12.6583 -0.075049     12.5351 -0.079489
#4  35427949712   12.6556 -0.075049     12.5401 -0.079591

请注意选择哪个DataFrame作为左框架或右框架,因为这会改变结果.在这种情况下,我选择的是df1中的time,该绝对​​距离最接近df2中的time.

Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2.

如果在右边的df中重复了on键,则还需要小心,因为对于完全匹配,merge_asof仅将右边的df的最后排序行合并到左边的df,而不是为每个完全匹配都创建多个条目.如果有问题,您可以先合并确切的键以获取所有组合,然后将其余的与asof合并.

You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df, instead of creating multiple entries for each exact match. If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof.

这篇关于Python Pandas:沿着一列比较两个数据帧,并在另一个数据帧中返回两个数据帧的行内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆