形成个体python( pandas )的分组 [英] Form groups of individuals python (pandas)

查看:75
本文介绍了形成个体python( pandas )的分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下形式的数据集:

 将熊猫作为pd 
d1 = {'Subject ':[[Subject1','Subject1','Subject1','Subject2','Subject2','Subject2','Subject3','Subject3','Subject3','Subject4','Subject4','Subject4' ],
'事件':['1','2','3','1','2','3','1','2','3','1' ,'2','3'],
'Category':['1','1','2','2','1','2','2','', '2','1','1',''],
'变量1':['1','2','3','4','5','6',' 7','8','9','10','11','12'],
'Variable2':['12','11','10','9',' 8','7','6','5','4','3','2','1'],
'Variable3':['-6','-5' ,'-4','-3','-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)
d1 = d1 [['Subject','Event','Category','Variable1','Variable2','Variable3']]
d1

如下所示:





其中

1)主题是su对象级别标识符。

2)事件是事件级别的标识符。

3)类别是类别级别的标识符。

4)Variable1,Variable2& Variable3是每个主题的一些连续变量。



我需要为每个类别的事件将主题的所有可行的2组作为。 p>

例如,对于事件1,唯一可能的对是:
1)Subject1-Subject4(对于类别1)
2)Subject2-Subject3(对于类别2)



请注意,如果缺少类别值,则表明主题被认为未参加活动



组成每个可能的组之后,我必须对两个主题分别使用Variable1,Variable2和Variable3,并将它们并排放置。



应如下所示:





重要的是要保持主题出现在Match1和Match2列下的顺序以及Variable1,Variable2,Variable3列的顺序。



事件2的可能配对如下所示:



请注意,因为对于Subject3,类别为空白,则她不会出现在配对中。





类似地,事件3的可能配对如下所示:
请注意,因为对于Subject4,类别为空,她没有出现在配对中。





最终表如下:





请注意,所有数字均为随机。在实际的数据集中,我大约有15个类别,每个类别约有1000个主题,涵盖300个事件。在某些情况下,某些类别可能对活动没有观察到,如此处所示。



如果我的问题不太清楚或如果我在这里的配对示例中犯了一个错误。



任何帮助将不胜感激。

解决方案

使用:

 从itertools导入组合

d1 ['Category'] = d1 ['Category']。mask(d1 ['Category'] =='')

L = [(i [0],i [1],y [0],y [1])对于d1中的i,x.groupby(['Event','Category'])['Subject']
for list(combinations(x,2))中的y]
df = pd.DataFrame(L,columns = ['Event','Category','Match1','Match2'])

df1 =(df.rename(columns = {'Match1':'Subject'})
.merge(d1,on = ['Event','Category','Subject'] ,how ='left')
.iloc [:, 4:]
.add_suffix('。1'))
df2 =(df.rename(columns = {'Match2': 'Subject'})
.merge(d1,on = ['Event','Category','Subject'],how ='left')
.iloc [:, 4:]
.add_suffix('。2'))

fin = pd.concat([df,df1,df2],axis = 1)






  print(fin)
事件类别Match1 Match2 Variable1。 1个Variabl e2.1变量3.1 \
0 1 1 Subject1 Subject4 1 12 -6
1 1 2 Subject2 Subject3 4 9 -3
2 2 1 Subject1 Subject2 2 11 -5
3 2 1主题1主题4 2 11 -5
4 2 1主题2主题4 5 8 -4
5 3 2主题1主题2 3 10 -4
6 3 2主题1主题3 3 10 -4
7 3 2 Subject2 Subject3 6 7 -3

Variable1.2 Variable2.2 Variable3.2
0 10 3 1
1 7 6 -2
2 5 8 -4
3 11 2 2
4 11 2 2
5 6 7 -3
6 9 4 0
7 9 4 0

说明


  1. 通过 mask - groupby 悄悄删除这些行

  2. 通过列表推导创建 DataFrame 并扁平化长度为 2 主题列中的c>,按每列事件类别

  3. 通过 合并 ,并进行左连接,按4 列href = http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html rel = nofollow noreferrer> ilo c 并添加 add_suffix add_prefix 以避免重复的列名

  4. 最后一个 concat 所有3 DataFrames一起


I have a data set of the following form:

import pandas as pd
d1 = {'Subject': ['Subject1','Subject1','Subject1','Subject2','Subject2','Subject2','Subject3','Subject3','Subject3','Subject4','Subject4','Subject4'],
'Event':['1','2','3','1','2','3','1','2','3','1','2','3'],
'Category':['1','1','2','2','1','2','2','','2','1','1',''],
'Variable1':['1','2','3','4','5','6','7','8','9','10','11','12'],
'Variable2':['12','11','10','9','8','7','6','5','4','3','2','1'],
'Variable3': ['-6','-5','-4','-3','-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)
d1=d1[['Subject','Event','Category','Variable1','Variable2','Variable3']]
d1

This looks as follows:

Where
1) 'Subject' is the subject level identifier.
2) 'Event'is the event level identifier.
3) 'Category' is the category level identifier.
4) Variable1, Variable2 & Variable3 are some continuous variables for each subject.

I need to make all feasible groups of 2 for 'Subject' for 'Event' for each 'Category'.

For instance, for Event 1, the only possible pairs are: 1) Subject1 - Subject4 (For Category 1) 2) Subject2 - Subject3 (For Category 2)

Note, if a category value is missing, then this indicates the 'Subject' is to be considered to have not taken part in the event.

After forming each possible group, I have to take the Variable1, Variable2 and Variable3 for both 'Subject' and put them side by side.

This should look like the following:

What is important is to maintain the order in which 'Subject' appears under Match1 and Match2 columns and the ordering of Variable1, Variable2,Variable3 columns.

The possible pairingsfor Event 2 is shown below:

Note since for Subject3, Category is blank, she does not appear in the pairings.

Similarly, the possible pairings for Event 3 is shown below: Note since for Subject4, Category is blank, she does not appear in the pairings.

The final table looks like this:

Note that all numbers are random. In the actual dataset, I have about 15 categories each with about 1000 subjects spanning across 300 events. In some cases, some categories may have no observations for an event just as shown here.

Please let me know if you my question is not very clear or if I made a mistake in the pair examples here.

Any help will be appreciated. Thanks in advance.

解决方案

Use:

from  itertools import combinations

d1['Category'] = d1['Category'].mask(d1['Category'] == '')

L = [(i[0], i[1], y[0], y[1]) for i, x in d1.groupby(['Event','Category'])['Subject'] 
                              for y in list(combinations(x, 2))]
df = pd.DataFrame(L, columns=['Event','Category','Match1','Match2'])

df1 = (df.rename(columns={'Match1':'Subject'})
         .merge(d1, on=['Event','Category','Subject'], how='left')
         .iloc[:, 4:]
         .add_suffix('.1'))
df2 = (df.rename(columns={'Match2':'Subject'})
         .merge(d1, on=['Event','Category','Subject'], how='left')
         .iloc[:, 4:]
         .add_suffix('.2'))

fin = pd.concat([df, df1, df2], axis=1)


print (fin)
  Event Category    Match1    Match2 Variable1.1 Variable2.1 Variable3.1  \
0     1        1  Subject1  Subject4           1          12          -6   
1     1        2  Subject2  Subject3           4           9          -3   
2     2        1  Subject1  Subject2           2          11          -5   
3     2        1  Subject1  Subject4           2          11          -5   
4     2        1  Subject2  Subject4           5           8          -4   
5     3        2  Subject1  Subject2           3          10          -4   
6     3        2  Subject1  Subject3           3          10          -4   
7     3        2  Subject2  Subject3           6           7          -3   

  Variable1.2 Variable2.2 Variable3.2  
0          10           3           1  
1           7           6          -2  
2           5           8          -4  
3          11           2           2  
4          11           2           2  
5           6           7          -3  
6           9           4           0  
7           9           4           0  

Explanation:

  1. Replace empty strings to NaNs by mask- groupby siletly remove these rows
  2. Create DataFrame by list comprehension with flattening of all combinations of length 2 of column Subject by groups per columns Event and Category
  3. Double join variable columns by merge with left join, filter out first 4 columns by positions by iloc and add add_suffix or add_prefix for avoid duplicated columns names
  4. Last concat all 3 DataFrames together

这篇关于形成个体python( pandas )的分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆