如何联接具有相同键和互补值的两行 [英] How to join two rows that have the same keys and complementary values

查看:95
本文介绍了如何联接具有相同键和互补值的两行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是将下表折叠成一个单独的列,这个问题专门针对下面的蓝色行.该表具有三个类别变量和6个分析/定量变量.列C1和C2是成功连接所需的唯一变量.所有空白单元格均为NaN,下面是用于复制的python代码.这些行是独立导出的,因为它们具有在其他相关表中找到的信息,但未包含在导出中.

My goal is to collapse the below table into one single column and this question deals specifically with the blue row below. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join. All blank cells are NaNs and python code for copying is below. These rows are exported independently because they have information found in other related tables, not included in the export.

问题. (蓝色)蓝色行中的某些定量信息也位于灰色行中,而某些则不行.有没有办法将新信息(问题6中的-8)复制到灰色行,然后删除/突出显示蓝色行?假设drop_duplicates的keep ='first'默认值处于活动状态,则此处将保留灰色行分类信息.

Question. (Blue) Some of the quantitative information in the blue row is also in the grey row and some is not. Is there a way to copy the new information (-8 in Q6) into the grey row and then delete/highlight the blue row? Here, the grey row categorical information is maintained, assuming the keep='first' default of drop_duplicates is active.

相关问题.(黄色行)

期望的输出将使蓝色行中的灰色行更新为Q6,并删除蓝色行.

The expected output would have the grey row updated with Q6 from the blue row and the blue row removed.

[['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,-8]]


当前进度

我当前的代码包括此行,以删除所有定量变量均为NaN的所有行.
df.dropna(subset=df.columns[4:],how='all', inplace=True)


Current Progress

My current code includes this line to drop all rows where all quantitative variables are NaN.
df.dropna(subset=df.columns[4:],how='all', inplace=True)

此外,此行用于删除所有定量变量都相同的所有行.
df.drop_duplicates(subset=df.columns[4:], inplace=True)

Also, this line for deleting all rows where all quantitative variables are the same.
df.drop_duplicates(subset=df.columns[4:], inplace=True)

import pandas as pd

df = [['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,nan],
 ['S1','P3','H1',Timestamp('2004-12-20 00:00:00'),nan,nan,nan,nan,nan,nan],
 ['S1','P3','H2',Timestamp('2004-12-20 00:00:00'),-15.0,nan,nan,-10.0,nan,nan],
 ['S1','P3','H3',Timestamp('2004-12-07 00:00:00'),nan,nan,nan,nan,-15.0,-8.0],
 ['S1','P3','H1', Timestamp('2004-12-04 00:00:00'), -15.0,-27.4,nan,-10.0, -15.0, nan]]
cols = ['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
pd.DataFrame(data=df,columns=cols)

df.drop_duplicates(inplace=True)
df.dropna(subset=df.columns[4:],how='all', inplace=True)
df.drop_duplicates(subset=df.columns[4:], inplace=True)

推荐答案

分类列的拆分:

df_categorical = df[['C1 (PK)', 'C2 (FK)',"C3", "C4"]]

在前2列上执行分组,并选择要保留的第一个元素:

Perform a groupby on first 2 columns and select first element to keep:

df_categorical = df_categorical.groupby(["C1 (PK)", "C2 (FK)"]).first()

对于定量列,再次使用groupby并使用均值:

For the quantitative columns use groupby again and use mean this time:

df_quantitative = df.groupby(['C1 (PK)', 'C2 (FK)']).mean()

合并两个数据框以获得结果

merge the two dataframes to get the result

df_final = pd.concat([df_quantitative, df_categorical], axis=1)

重置索引

df_final.reset_index(inplace=True)

这篇关于如何联接具有相同键和互补值的两行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆