通过python中的多个条件合并不同数量的行 [英] merging varying number of rows by multiple conditions in python

查看:184
本文介绍了通过python中的多个条件合并不同数量的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:按多个条件合并不同数量的行

Problem: merging varying number of rows by multiple conditions

这是数据集外观的风格示例

Here is a stylistic example of how the dataset looks like

"index" "connector" "type" "q_text" "a_text" "varx" ...
   1        1111      1      aa       NA       xx
   2        9999      2      NA       tt       NA
   3        1111      2      NA       uu       NA
   4        9999      1      bb       NA       yy
   5        9999      1      cc       NA       zz

目标:数据集应该是什么样子

"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
   1        1111      1       2        aa        NA        NA        uu        xx      NA
   2        9999      1       2        bb        NA        NA        tt        yy      NA
   3        9999      1       2        cc        NA        NA        tt        zz      NA

逻辑:列类型"具有值 1 或 2,而多行具有值 1 但只有一行(在连接器"中具有相同值)具有值 2

Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2

如果连接器"中的相同值然后合并type"=2 的行,type"=1 的行但是(因为type"=1 的多行在connector"中具有相同的值)重复type=2的对应行和合并在connector"中也具有相同值的所有其他行并且是类型"=1

If same values in "connector" then merge rows of "type"=2 with rows of "type"=1 but (because multiple rows of "type"=1 have the same value in "connector") duplicate the corresponding rows of type=2 and merge all of the other rows that also have the same value in "connector" and are of "type"=1

我的结果:并非所有结果都被合并,因为type"=1 的多行与type"=2 的 UNIQUE 行相关联

My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2

大多数类似的问题都是使用 SQL 回答的,我不能在这里使用.

Most similar questions are answered using SQL, which i cannot use here.

df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])

我可以通过 goupby() 实现我的目标吗?

Can i achieve my goal by goupby() ?

@victor__von__doom 的解决方案

Solution by @victor__von__doom

if __name__ == '__main__':
    df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
    df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']]  = pd.DataFrame(df.merged.tolist())
    df = df.drop(['merged'], axis=1)

推荐答案

首先,在合并行时继续将新列连接到原始 DataFrame 真的很麻烦,尤其是当数字列非常大.此外,如果您最终将 1 个连接器值的 3 行和另一个(例如)的 4 行合并,则包含所有值的唯一方法是为某些行创建空列,这绝不是一个好主意.相反,我这样做是为了将合并的行组合成元组,然后可以有效地解析元组,同时保持 DataFrame 的大小可管理:

First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:

import numpy as np
import pandas as pd

if __name__ == '__main__':
    data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
                     [1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
                     ['NA', 'tt', 'uu', 'NA', 'NA'],
                     ['xx', 'NA', 'NA', 'yy', 'zz']])

    df = pd.DataFrame(data.T, columns = ["index", "connector",
                          "type", "q_text", "a_text", "varx"])

    df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
    df[["type", "q_text", "a_text", "varx"]]  = pd.DataFrame(df.merged.tolist())
    df = df.drop(['merged'], axis=1)

最终的 DataFrame 看起来像:

  connector       type        q_text        a_text          varx ...
0      1111     (1, 2)      (aa, NA)      (NA, uu)      (xx, NA) ...
1      9999  (2, 1, 1)  (NA, bb, cc)  (tt, NA, NA)  (NA, yy, zz) ...

哪个更紧凑和可读.

这篇关于通过python中的多个条件合并不同数量的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆