pandas 基于多个条件将行链接到行 [英] Pandas Link Rows to Rows Based on Multiple Criteria

查看:88
本文介绍了 pandas 基于多个条件将行链接到行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我看来,这是熊猫的终极挑战,尽管它可能对其中的某些人来说是最基本的……

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...

我正在尝试将特定职位与与其对应的调查项目相关联.例如,站点A的总裁将归因于一项调查项目的结果,站点A的受访者为此提供了反馈(即您在多大程度上同意以下说法?":我认为站点A的质量足够整体"".

I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").

每个站点都有5个站点(0到4).将每个职位分配给一个或多个站点的一个或多个站点.

Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).

例如,某站点的总裁在该站点内的所有站点工作,而承包商可能只在几个站点(可能在2个不同站点)工作.

For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.

收集有关每个站点内每个站点质量的调查数据.

Survey data was collected on the quality of each station within each site.

一些调查项目与一个或多个站点中的某些站点有关.

Some survey items pertain to certain stations within one or more sites.

例如,位置"表如下所示:

For example, the "Positions" table looks like this:

import pandas as pd
import numpy as np

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']
                   })

pos[['Position','Site(s)','Station(s)','Item(s)']]

    Position    Site(s)     Station(s)   Item(s)
0   Contractor  A,B          ,1,2,,         1
1   President   A           0,1,2,3,4       1,2

调查数据表如下:

sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]

    Site    Station(s)  Item 1  Item 2
0      A       ,1,2,,          1    1
1      B       ,1,2,,          1    0
2      B       ,,,,            0    0
3      C       ,1,2,,          0    1
4      A       0,1,2,,         1    0
5      A       ,,2,,           NaN  1

2个旁注:

  1. 由于不重要的原因,商品数据已被编码为1和0.

  1. The item data has been coded to 1 and 0 for unimportant reasons.

逗号分隔的响应实际上是从列(每个站点和每个项目一列)中压缩而成的.我仅提及,因为如果最好不压缩它们,那么可以(或不可以)做到这一点.

The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).

这就是我所需要的:

这个(我认为):

    Contractor  President Site(s)  Station(s)   Item 1  Item 2
0      1           1        A       ,1,2,,        1       1
1      1           0        B       ,1,2,,        1       0
2      0           0        B       ,,,,          0       0
3      0           0        C       ,1,2,,        0       1
4      0           1        A       0,1,2,,       1       0
5      1           1        A       ,,2,,         NaN     1

逻辑:

承包商在站点A和B上工作,并且仅应与在这两个站点中的任何一个上工作的应答者相关联. 在这些受访者中,他应仅与在1号或2号车站工作的人相关联,而 其他任何号码(例如,第0站).

The contractor works at site A and B and should only be associated with respondents who work either of those sites. Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at any others (i.e. station 0).

因此,承包商在df2中的关注行为索引0、1和5. 总统感兴趣的行来自索引0、4和5.

Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5. The president's rows of interest are from indices 0, 4, and 5.

...,最终,这是

    Position    Overall%
0   Contractor  100
1   President   80

逻辑:

由于总统关心第1项和第2项,因此需要考虑5个数字:第1项的(1和1)和第2项的(1、0和1). 所有项目的总数为4,所有项目的计数为5(再次,不计算"NaN"),得出80%.

Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.

由于承包商仅关注第1项,因此需要考虑2个数字:1和1-不应计算"NaN"-(分别来自感兴趣的行).因此,总和是计数中的2,即2,即100%

Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%

提前谢谢!

更新

我知道这可行(

I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):

for i in pos['Position']:
    sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]

...但是它引发了此错误:

...but it threw this error:

KeyError: 'the label [Site(s)] is not in the [index]'

...所以我仍在努力.

...so I'm still wrestling with it.

推荐答案

如果我正确理解,您想为pos中的每个职位添加一列到sd中. (这是第一个任务.)

If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)

因此,对于pos中的每个行索引i(我们遍历pos中的行),我们可以创建一个唯一的布尔列:

So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:

# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])

我希望这里的逻辑清晰并与您的目标保持一致.

I hope the logic here is clear and consistent with your goal.

表达式X IS_CONTAINED_IN Y可以用许多不同的方式实现. (我可以想到X和Y被设置,然后是X.subset(Y).或者就位掩码X,Y和bitwise_xor而言.)

The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)

列名position_name_i可以只是一个整数i. (或者,如果此列由唯一值组成,则更有意义的名称是pos.loc[i,'Position'].)

The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)

完成此操作后,我们可以完成其他任务.现在,df[df[position_name_i]]将仅返回position_name_i为True的df行.

If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.

我们遍历所有位置(即pos中的行).对于每个职位:

We iterate over all positions (i.e. rows in pos). For each position:

# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries: 
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()

给定位置的最终Overall%100*partial/total.

这篇关于 pandas 基于多个条件将行链接到行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆