根据列值将Pandas Dataframe拆分为单独的片段 [英] Split Pandas Dataframe into separate pieces based on column values

查看:100
本文介绍了根据列值将Pandas Dataframe拆分为单独的片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用Python 2.7在Pandas中执行一些内部联接。这是我正在使用的数据集:

  panda $ p 
导入numpy np

列= ['s_id','c_id','c_col1']
索引= np.arange(46)#样本数量的数字数组
df = pd.DataFrame(列=列,索引=索引)

df.s_id [:15] = 144
df.s_id [15:27] = 105
df.s_id [27:46 ] = 52

df.c_id [:5] = 1
df.c_id [5:10] = 2
df.c_id [10:15] = 3
df.c_id [15:19] = 1
df.c_id [19:27] = 2
df.c_id [27:34] = 1
df.c_id [34: 39] = 2
df.c_id [39:46] = 3

df.c_col1 [:5] = ['H','C','N','O' ,'S']
df.c_col1 [5:10] = ['C','O','S','K','Ca']
df.c_col1 [10:15 ] = ['H','O','F','Ne','Si']
df.c_col1 [15:19] = ['C','O','F',' Zn']
df.c_col1 [19:27] = ['N','O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1 [27:34] = ['H','He','Li','B','N','Al','Si']
df.c_col1 [34 :39] = ['N','F','Ne','Na','P']
df.c_col1 [39:46] = ['C','N','O' ,'F','K','Ca','Fe']

以下是数据框:

  s_id c_id c_col1 
0 144 1 H
1144 1 C
2144 1 N
3144 1 O <-
4144 1 S
5 144 2 C
6 144 2 O<-
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O ---
12144 3 F
13144 3 Ne
14144 3 Si
15105 1 C
16 105 1 O
17105 1 F
18105 1 Zn
19105 2 N
20105 2 O
21105 2 F
22105 2 Fe
23105 2 Zn
24105 2 Gd
25105 2 Hg
26105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N $ b $ 35 35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe

我需要在熊猫中进行以下操作:


  1. 在给定的s_id中,单独产生每个c_id值的数据帧。例如对于s_id = 144,将有3个数据帧,而对于s_id = 105,将有2个数据帧

  2. Inner将在a。)中产生的单独数据帧连接到元素列上(c_col1)在熊猫中。这有点难以理解,所以这是我要从此步骤中获得的数据帧:



    index s_id c_id c_col1

      0 144 1 O 
    1144 2 O
    2 144 3 O
    3105 1 O
    4105 2 F
    5 52 1 N
    6 52 2 N
    7 52 3 N

    如您所见,我在第2部分中寻找的内容如下:在每个s_id中,我正在寻找对于所有c_id值都出现的那些c_col1值。例如在s_id = 144的情况下,对于c_id = 1、2、3,仅会出现O(氧)。我已经在原始数据中指出了这些条目,并以<-表示。因此,我希望数据帧在c_col1列中显示O 3次,并且对应的c_id条目将为1、2、3。


条件:


  1. 提前知道唯一c_id的数量。即对于一个
    特别的s_id,我不知道会出现1、2和3还是只有1
    和2。这意味着如果出现1、2和3,就会有一个内部
    加入;如果只有1和2出现,那么将只有一个内部联接。

如何用熊猫做到这一点?

解决方案

生成单独的数据帧非常容易。您想如何存储它们?一种方法是在嵌套字典中,其中外键是s_id,内键是c_id,而内部值是数据。您可以使用相当长但直接的字典理解来完成此操作:

  DF_dict = {s_id:
{c_id:df [(df.s_id == s_id)& (df.c_id == c_id)] for df [df.s_id == s_id] ['c_id']。unique()}
for df.s_id.unique()中的s_id

然后例如:

 在[12]中:DF_dict [52] [2] 
出[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P

我不理解您的问题的第二部分。然后要在s_id中加入数据吗?你能证明预期的输出是多少?如果要在每个s_id内执行某些操作,则最好探索groupby选项。也许有人了解您想要什么,但是如果您能澄清一下,我也许可以显示出跳过问题第一部分的更好选择...



###### ########### EDIT

在我看来,如果问题1只是您认为必要的一个步骤,那么您应该直接解决问题2获得问题2解决方案。实际上,这完全没有必要。要解决第二个问题,您需要按s_id对数据进行分组,并根据需要转换数据。总结一下我的需求,规则如下:对于s_id分组的每个数据组,仅返回c_id的每个值都具有相等值的ccol_1数据。



您可以编写如下函数:

  def c_id_overlap( df):
common_vals = []#包含c_id子组
c_ids = df.c_id.unique()的c_col1值的容器#获取c_id的唯一值
c_col1_values = set( df.c_col1)#获取一组c_col1值
#创建嵌套值列表。每个内部列表都包含每个c_id
的c_col1值nested_c_col_vals = [list(df [df.c_id == ID] ['c_col1']。unique()),用于c_ids中的ID]
#迭代c_col1_values并查看它们是否在每个嵌套列表中c_col1_values中的val的每个嵌套列表

if all([如果elem中的val为true,否则nested_c_col_vals中的elem均为False]):
common_vals.append( val)
#返回仅包含每个
中的c_col1值的数据帧的一部分#c_id
返回df [df.c_col1.isin(common_vals)]

,然后将其传递给 apply 并按s_id分组的数据:

  df.groupby('s_id',as_index = False).apply(c_id_overlap)

这给出了以下输出:

  s_id c_id c_col1 
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17105 1 F
18105 1锌
20105 2 O
21105 2 F
23105 2 Zn
2 3 144 1 O
6 144 2 O
11144 3 O

您似乎正在寻找什么。



########### EDIT:附加说明:

因此 apply 传递分组的每个块对每组数据完成此操作后,功能中的数据和各个部分便重新粘合在一起。



所以请考虑第一个通过的组,其中 s_id ==105。该函数的第一行创建一个空列出 common_vals ,其中将包含出现在数据每个子组中的那些周期性元素(即相对于 c_id )。



第二行获取'c_id'的唯一值,在本例中为 [1、2] 并存储它们在名为 c_ids

的数组中第三行创建一组值 c_col1 在这种情况下会产生:

  {'C','F','Fe' ,'Gd','Hg','N','O','Pb','Zn'} 

第四行创建一个嵌套列表结构 nested_c_col_vals ,其中每个内部列表都是与 c_ids 数组。在这种情况下,它看起来像这样:

  [['C','O','F','Zn'] ,['N','O','F','Fe','Zn','Gd','Hg','Pb']] 

现在遍历 c_col1_values 列表中的每个元素,程序会针对每个这些元素确定是否元素出现在 nested_c_col_vals 对象的每个内部列表中。 all 函数的作用,确定后退序列之间序列中的每个项目是否为 True ,或者是否为非-零(您需要检查此内容)。因此:

 在[10]中:all([True,True,True])
Out [10]:真

入[11]:全部([真,真,真,假])
出[11]:假

入[12]:全部([True,True,True,1])$ ​​b $ b Out [12]:真实

In [13]:all([[True,True,True,0])
Out [13]:False

In [14]:all([True,1,True,0])
Out [14]:False

因此,在这种情况下,假设 C是迭代的第一个元素。 all() backets内的列表理解说,在每个内部列表内查看并查看元素是否存在。如果不是,则为 True ,否则为 False 。因此,在这种情况下,它解析为:

  all([True,False])

当然是 False 。否,当元素为 Zn时,此操作的结果为

  all([True,True])

解析为 True 。因此, Zn被附加到 common_vals 列表中。



该过程完成后, common_vals 中的值是:

  ['O','F','Zn'] 

return语句只是根据变量os c_col1 是否在列表 common_vals 如上所述。



然后对其余每个组重复此操作,并将数据重新粘合在一起。



希望这会有所帮助


I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:

import pandas as pd
import numpy as np

columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)

df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52

df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3

df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']

Here is the dataframe:

   s_id c_id c_col1
0   144    1      H
1   144    1      C
2   144    1      N
3   144    1      O <--
4   144    1      S
5   144    2      C
6   144    2      O <--
7   144    2      S
8   144    2      K
9   144    2     Ca
10  144    3      H
11  144    3      O <--
12  144    3      F
13  144    3     Ne
14  144    3     Si
15  105    1      C
16  105    1      O
17  105    1      F
18  105    1     Zn
19  105    2      N
20  105    2      O
21  105    2      F
22  105    2     Fe
23  105    2     Zn
24  105    2     Gd
25  105    2     Hg
26  105    2     Pb
27   52    1      H
28   52    1     He
29   52    1     Li
30   52    1      B
31   52    1      N
32   52    1     Al
33   52    1     Si
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P
39   52    3      C
40   52    3      N
41   52    3      O
42   52    3      F
43   52    3      K
44   52    3     Ca
45   52    3     Fe

I need to do the following in Pandas:

  1. In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
  2. Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:

    index s_id c_id c_col1

    0   144    1      O
    1   144    2      O
    2   144    3      O
    3   105    1      O
    4   105    2      F
    5    52    1      N
    6    52    2      N
    7    52    3      N
    

    As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.

Conditions:

  1. the number of unique c_ids are not known ahead of time.i.e. for one particular s_id, I do not know if there will be 1, 2 and 3 or just 1 and 2. This means that if 1, 2 and 3 occur, there will be one Inner Join; if only 1 and 2 occur, then there will be only one Inner Join.

How can this be done with Pandas?

解决方案

Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:

DF_dict = {s_id : 
          {c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()} 
          for s_id in df.s_id.unique()}

Then for example:

In [12]: DF_dict[52][2]
Out[12]:
   s_id c_id c_col1
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P

I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...

##################EDIT

It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.

You might write a function like this:

def c_id_overlap(df):
    common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
    c_ids = df.c_id.unique() #get unique values of c_id
    c_col1_values = set(df.c_col1) # get a set of c_col1 values
    #create nested list of values. Each inner list contains the c_col1 values for each c_id
    nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
    #Iterate through the c_col1_values and see if they are in every nested list
    for val in c_col1_values:
        if all([True if val in elem else False for elem in nested_c_col_vals]):
            common_vals.append(val)
    #return a slice of the dataframe that only contains values of c_col1 that are in every
    #c_id
    return df[df.c_col1.isin(common_vals)]

and then pass it to apply on data grouped by s_id:

df.groupby('s_id', as_index = False).apply(c_id_overlap)

which gives me the following output:

     s_id c_id c_col1
0 31   52    1      N
  34   52    2      N
  40   52    3      N
1 16  105    1      O
  17  105    1      F
  18  105    1     Zn
  20  105    2      O
  21  105    2      F
  23  105    2     Zn
2 3   144    1      O
  6   144    2      O
  11  144    3      O

Which seems to be what you are looking for.

###########EDIT: Additional Explanation:

So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.

So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).

The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids

The third line creates a set of the values of c_col1 which in this case produces:

 {'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}

The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:

[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]

Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:

In [10]: all([True, True, True])
Out[10]: True

In [11]: all([True, True, True, False])
Out[11]: False

In [12]: all([True, True, True, 1])
Out[12]: True

In [13]: all([True, True, True, 0])
Out[13]: False

In [14]: all([True, 1, True, 0])
Out[14]: False 

So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:

all([True, False])

which is of course False. No when the element is 'Zn' the result of this operation is

all([True, True])

which resolves to True. Therefore 'Zn' is appended to the common_vals list.

Once the process is complete the values inside common_vals are:

['O', 'F', 'Zn']

The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.

This is then repeated for each of the remaining groups and the data are glued back together.

Hope this helps

这篇关于根据列值将Pandas Dataframe拆分为单独的片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆