通过python中的多个条件合并不同数量的行 [英] merging varying number of rows by multiple conditions in python
问题描述
问题:按多个条件合并不同数量的行
Problem: merging varying number of rows by multiple conditions
这是数据集外观的风格示例
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
目标:数据集应该是什么样子
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
逻辑:列类型"具有值 1 或 2,而多行具有值 1 但只有一行(在连接器"中具有相同值)具有值 2
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
如果连接器"中的相同值然后合并type"=2 的行,type"=1 的行但是(因为type"=1 的多行在connector"中具有相同的值)重复type=2的对应行和合并在connector"中也具有相同值的所有其他行并且是类型"=1
If same values in "connector" then merge rows of "type"=2 with rows of "type"=1 but (because multiple rows of "type"=1 have the same value in "connector") duplicate the corresponding rows of type=2 and merge all of the other rows that also have the same value in "connector" and are of "type"=1
我的结果:并非所有结果都被合并,因为type"=1 的多行与type"=2 的 UNIQUE 行相关联
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
大多数类似的问题都是使用 SQL 回答的,我不能在这里使用.
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
我可以通过 goupby() 实现我的目标吗?
Can i achieve my goal by goupby() ?
@victor__von__doom 的解决方案
Solution by @victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
推荐答案
首先,在合并行时继续将新列连接到原始 DataFrame
真的很麻烦,尤其是当数字列非常大.此外,如果您最终将 1 个连接器值的 3 行和另一个(例如)的 4 行合并,则包含所有值的唯一方法是为某些行创建空列,这绝不是一个好主意.相反,我这样做是为了将合并的行组合成元组,然后可以有效地解析元组,同时保持 DataFrame
的大小可管理:
First off, it is really messy to just keep concatenating new columns onto your original DataFrame
when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame
manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
最终的
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
哪个更紧凑和可读.
这篇关于通过python中的多个条件合并不同数量的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!