查找表中的查找字符串值以填充第二个数据帧 [英] lookup string values in lookup table to populate second dataframe

查看:64
本文介绍了查找表中的查找字符串值以填充第二个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框main_df:

  | header_1
0 | value_1
1 | value_2
2 | value_3
3 | value_1

和查找数据帧lookup_df:

  | header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_4 | lookup_value_4

main_df中的值不是唯一的. lookup_df中的值是唯一的.

The values in main_df are not unique. The values in `lookup_df' are unique.

我只是想用lookup_df中的相应lookup_value填充main df中的新列.

I simply want to populate a new column in main df with the corresponding lookup_value from lookup_df.

尝试了各种方法,包括.merge.join.map.lookup.

Have tried various approaches including .merge, .join, .map and .lookup.

main_df = pd.merge(main_df, lookup_df, how='inner', on=['header_1'])

我正在寻找的结果是:

  | header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_1 | lookup_value_1

推荐答案

您可以使用或者更快一点是转换Series to_dict :

Or a bit faster is convert Series to_dict:

main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2']
                                                       .to_dict())
print (main_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3
3  value_1  lookup_value_1

时间:

#[400000 rows x 1 columns]
main_df = pd.concat([main_df]*100000).reset_index(drop=True)

In [139]: %timeit pd.merge(main_df, lookup_df, how='left', on=['header_1'])
10 loops, best of 3: 73.1 ms per loop

In [140]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'])
10 loops, best of 3: 35.7 ms per loop

In [141]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'].to_dict())
10 loops, best of 3: 35.1 ms per loop

您需要lookup_df中列header_1的唯一值,一种可能的解决方案是

You need unique values of column header_1 in lookup_df, one possible solution is drop_duplicates:

print (lookup_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3
3  value_1  lookup_value_4

#keep first value, default parameter keep='first'
lookup_df = lookup_df.drop_duplicates(['header_1'])
print (lookup_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3

#keep last value
lookup_df1 = lookup_df.drop_duplicates(['header_1'], keep='last')
print (lookup_df1)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3

这篇关于查找表中的查找字符串值以填充第二个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆