如何使用另一个数据帧的子集填充数据帧的列? [英] How to populate columns of a dataframe using a subset of another dataframe?

查看:48
本文介绍了如何使用另一个数据帧的子集填充数据帧的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个这样的数据框

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'key': list('AAABBCCAAC'),
    'prop1': list('xyzuuyxzzz'),
    'prop2': list('mnbnbbnnnn')
})

df2 = pd.DataFrame({
    'key': list('ABBCAA'),
    'prop1': [np.nan] * 6,
    'prop2': [np.nan] * 6,
    'keep_me': ['stuff'] * 6
})

  key prop1 prop2
0   A     x     m
1   A     y     n
2   A     z     b
3   B     u     n
4   B     u     b
5   C     y     b
6   C     x     n
7   A     z     n
8   A     z     n
9   C     z     n

  key  prop1  prop2 keep_me
0   A    NaN    NaN   stuff
1   B    NaN    NaN   stuff
2   B    NaN    NaN   stuff
3   C    NaN    NaN   stuff
4   A    NaN    NaN   stuff
5   A    NaN    NaN   stuff

我现在想使用 df1 的值填充 df2 中的 prop1prop2 列.对于每个键,df1 中的行数将比 df2 中的行数更多或相等(在上面的示例中:5 次 A vs 3 次 A,2次B vs 2次B和3次C vs 1次C>).对于每个键,我想使用 df1 中每个键的前 n 行来填充 df2.

I now want to populate columns prop1 and prop2 in df2 using the values of df1. For each key, we will have more or equal rows in df1 than in df2 (in the example above: 5 times A vs 3 times A, 2 times B vs 2 times B and 3 times C vs 1 time C). For each key, I want to fill df2 using the first n rows per key from df1.

所以,我对 df2 的预期结果是:

So, my expected outcome for df2 would be:

  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff

由于 key 不是唯一的,我不能简单地构建一个字典然后使用 .map.

As key is not unique, I cannot simple build a dictionary and then use .map.

我希望这些方法能奏效:

I was hoping that something along these lines would work:

pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')

但是失败了

ValueError: 传递值的形状是 (5, 22),索引意味着 (5, 10)

ValueError: Shape of passed values is (5, 22), indices imply (5, 10)

因为 - 我猜 - 索引包含非唯一值.

as - I guess - the index contains non-unique values.

我怎样才能得到我想要的输出?

How can I get my desired output?

推荐答案

因为 key 值重复,可能的解决方案是通过 DataFrame 中创建新的计数器列href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumcount.html" rel="noreferrer">GroupBy.cumcount,因此可以用 keyg 创建的 MultiIndex 对齐替换 df2 中的缺失值带有 DataFrame.fillna:

Because duplicates in key values possible solution is create new counter columns in both DataFrames by GroupBy.cumcount, so possible replace missing values from df2 with align by MultiIndex created by key and g columns with DataFrame.fillna:

df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()

print (df1)
  key prop1 prop2  g
0   A     x     m  0
1   A     y     n  1
2   A     z     b  2
3   B     u     n  0
4   B     u     b  1
5   C     y     b  0
6   C     x     n  1
7   A     z     n  3
8   A     z     n  4
9   C     z     n  2

print (df2)
  key  prop1  prop2 keep_me  g
0   A    NaN    NaN   stuff  0
1   B    NaN    NaN   stuff  0
2   B    NaN    NaN   stuff  1
3   C    NaN    NaN   stuff  0
4   A    NaN    NaN   stuff  1
5   A    NaN    NaN   stuff  2

<小时>

df = (df2.set_index(['key','g'])
        .fillna(df1.set_index(['key','g']))
        .reset_index(level=1, drop=True)
        .reset_index())
print (df)
  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff

这篇关于如何使用另一个数据帧的子集填充数据帧的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆