如何使用另一个数据帧的子集填充数据帧的列? [英] How to populate columns of a dataframe using a subset of another dataframe?
问题描述
我有两个这样的数据框
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'key': list('AAABBCCAAC'),
'prop1': list('xyzuuyxzzz'),
'prop2': list('mnbnbbnnnn')
})
df2 = pd.DataFrame({
'key': list('ABBCAA'),
'prop1': [np.nan] * 6,
'prop2': [np.nan] * 6,
'keep_me': ['stuff'] * 6
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 C x n
7 A z n
8 A z n
9 C z n
key prop1 prop2 keep_me
0 A NaN NaN stuff
1 B NaN NaN stuff
2 B NaN NaN stuff
3 C NaN NaN stuff
4 A NaN NaN stuff
5 A NaN NaN stuff
我现在想使用 df1
的值填充 df2
中的 prop1
和 prop2
列.对于每个键,df1
中的行数将比 df2
中的行数更多或相等(在上面的示例中:5 次 A
vs 3 次 A
,2次B
vs 2次B
和3次C
vs 1次C
>).对于每个键,我想使用 df1
中每个键的前 n
行来填充 df2
.
I now want to populate columns prop1
and prop2
in df2
using the values of df1
. For each key, we will have more or equal rows in df1
than in df2
(in the example above: 5 times A
vs 3 times A
, 2 times B
vs 2 times B
and 3 times C
vs 1 time C
). For each key, I want to fill df2
using the first n
rows per key from df1
.
所以,我对 df2
的预期结果是:
So, my expected outcome for df2
would be:
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff
由于 key
不是唯一的,我不能简单地构建一个字典然后使用 .map
.
As key
is not unique, I cannot simple build a dictionary and then use .map
.
我希望这些方法能奏效:
I was hoping that something along these lines would work:
pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')
但是失败了
ValueError: 传递值的形状是 (5, 22),索引意味着 (5, 10)
ValueError: Shape of passed values is (5, 22), indices imply (5, 10)
因为 - 我猜 - 索引包含非唯一值.
as - I guess - the index contains non-unique values.
我怎样才能得到我想要的输出?
How can I get my desired output?
推荐答案
因为 key
值重复,可能的解决方案是通过 DataFrame 中创建新的计数器列href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumcount.html" rel="noreferrer">GroupBy.cumcount
,因此可以用 key
和 g
创建的 MultiIndex
对齐替换 df2
中的缺失值带有 DataFrame.fillna
:
Because duplicates in key
values possible solution is create new counter columns in both DataFrame
s by GroupBy.cumcount
, so possible replace missing values from df2
with align by MultiIndex
created by key
and g
columns with DataFrame.fillna
:
df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()
print (df1)
key prop1 prop2 g
0 A x m 0
1 A y n 1
2 A z b 2
3 B u n 0
4 B u b 1
5 C y b 0
6 C x n 1
7 A z n 3
8 A z n 4
9 C z n 2
print (df2)
key prop1 prop2 keep_me g
0 A NaN NaN stuff 0
1 B NaN NaN stuff 0
2 B NaN NaN stuff 1
3 C NaN NaN stuff 0
4 A NaN NaN stuff 1
5 A NaN NaN stuff 2
<小时>
df = (df2.set_index(['key','g'])
.fillna(df1.set_index(['key','g']))
.reset_index(level=1, drop=True)
.reset_index())
print (df)
key prop1 prop2 keep_me
0 A x m stuff
1 B u n stuff
2 B u b stuff
3 C y b stuff
4 A y n stuff
5 A z b stuff
这篇关于如何使用另一个数据帧的子集填充数据帧的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!