在pandas df中重新分配列值 [英] Re-assign column values in a pandas df

查看:111
本文介绍了在pandas df中重新分配列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与花名册或人员配置有关.我正在尝试将各种工作分配给个人(员工).使用下面的df

This question is related to rostering or staffing. I'm trying to assign various jobs to individuals (employees). Using the df below,

`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time

因此,[Area][Place]一起将构成不同工作的unique值.这些值将分配给个人,其总体目标是使用尽可能少的个人.对于任何一个人来说,最唯一的值assigned是3.[On]显示[Place][Area]的当前unique值的数量.因此,这为我需要多少个人提供了具体的指导.例如

So [Area] and [Place] together will make up unique values that are different jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible. The most unique values assigned to any one individual is 3. [On] displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,

1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc

问题: [Area][Place]中的unique值的数量大于3的情况给我造成了麻烦.我无法执行groupby,其中我assign第一个3 unique valuesindividual 1以及接下来的3个unique值至individual 2等.我想将[Area]通过[Area].因此,将[Area]中与assign相同的值用于一个个体(最多3个).然后,如果有剩余个值(<3),则应将它们合并为一组(如果可能),为3.

Question: Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc. I want to group unique values in [Area] and [Place] by [Area]. So look to assign same values in [Area] to an individual (up to 3). Then, if there are leftover values (<3), they should be combined to make a group of 3, where possible.

我设想此工作的方式是:hour展望未来.对于每个新的row值,script应该查看多少个[On]值(这表明需要多少个人).如果unique值> 3,则它们应为assigned乘以grouping,与[Area]中的值相同.如果有剩余个值,则应将它们组合起来以组成3个一组.

The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.

将其逐步进行:

1)使用[On] Column来确定需要多少人,方法是展望未来寻找hour

1) Use the [On] Column to determine how many individuals are required by looking into the future for an hour

2)如果出现超过3个unique值,请在[Area]中首先分配相同的值.

2) Where there are more than 3 unique values occurring assign the identical values in [Area] first.

3)如果有任何剩余值,则无论如何都希望合并.

3) If there are any leftover values then look to combine anyway possible.

对于下面的df,对于[Place][Area]以及hour,有9个unique值出现.因此,我们应该有3个人assigned.当unique值> 3时,应由[Area]分配,并查看是否出现相同的值. 剩余值应与其他少于3个unique值的个体组合.

For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.

import pandas as pd
import numpy as np

d = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],                 
    'Area' : ['A','B','C','D','E','D','E','F','G'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],   
     })

df = pd.DataFrame(data=d)

这是我的尝试:

def reduce_df(df):
    values = df['Area'] + df['Place']
    df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
    person_count = df1.groupby('Person')['Person'].agg('count')
    leftover_count = person_count[person_count < 3] # the 'leftovers'

    # try merging pairs together
    nleft = leftover_count.shape[0]
    to_try = np.arange(nleft - 1)
    to_merge = (leftover_count.values[to_try] + 
                leftover_count.values[to_try + 1]) <= 3
    to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
    to_merge = to_try[to_merge]
    merge_dict = dict(zip(leftover_count.index.values[to_merge+1], 
                leftover_count.index.values[to_merge]))
    def change_person(p):
        if p in merge_dict.keys():
            return merge_dict[p]
        return p
    reduced_df = df.copy()
    # update df with the merges you found
    reduced_df['Person'] = reduced_df['Person'].apply(change_person)
    return reduced_df

df1 = (reduce_df(reduce_df(df)))

这是输出:

       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 4
4   8:35:00  House 5    E  5  Person 5
5   8:40:00  House 1    D  6  Person 4
6   8:42:00  House 2    E  7  Person 5
7   8:45:00  House 3    F  8  Person 5
8   8:50:00  House 2    G  9  Person 7

这是我的预期输出:

       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 5    E  5  Person 3
5   8:40:00  House 6    D  6  Person 2
6   8:42:00  House 2    E  7  Person 3
7   8:45:00  House 3    F  8  Person 2
8   8:50:00  House 2    G  9  Person 3

关于如何获取此输出的说明:

Description on how I want to get this output:

Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3

示例2:

d = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],                 
    'Area' : ['X','X','X','X','X','X','X','X','X'],     
    'On' : ['1','2','3','3','3','3','3','3','3'], 
    'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],   
    })

    df = pd.DataFrame(data=d)

我遇到错误:

 IndexError: index 1 is out of bounds for axis 1 with size 1

在这一行:

df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]

但是,如果我将Person更改为1,2,3重复,它将返回以下内容:

However, if I change the Person to 1,2,3 repeating, it returns the following:

'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'], 

      Time    Place Area On    Person
0  8:03:00  House 1    X  1  Person 1
1  8:17:00  House 2    X  2  Person 1
2  8:20:00  House 3    X  3  Person 1
3  8:28:00  House 1    X  3  Person 2
4  8:35:00  House 2    X  3  Person 2
5  8:40:00  House 3    X  3  Person 2
6  8:42:00  House 1    X  3  Person 3
7  8:45:00  House 2    X  3  Person 3
8  8:50:00  House 3    X  3  Person 3

预期输出:

      Time    Place Area On    Person
0  8:03:00  House 1    X  1  Person 1
1  8:17:00  House 2    X  2  Person 1
2  8:20:00  House 3    X  3  Person 1
3  8:28:00  House 1    X  3  Person 1
4  8:35:00  House 2    X  3  Person 1
5  8:40:00  House 3    X  3  Person 1
6  8:42:00  House 1    X  3  Person 1
7  8:45:00  House 2    X  3  Person 1
8  8:50:00  House 3    X  3  Person 1

示例2的主要内容是:

1) There are <3 unique values on so assign to individual 1

推荐答案

更新

此答案的在线版本可以在线尝试.

以下是allocatePeople函数形式的答案.它基于预先计算一个小时内重复出现的所有索引的所有索引:

Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:

from collections import Counter
import numpy as np
import pandas as pd

def getAssignedPeople(df, areasPerPerson):
    areas = df['Area'].values
    places = df['Place'].values
    times = pd.to_datetime(df['Time']).values
    maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
    assignmentCount = Counter()
    assignedPeople = []
    assignedPlaces = {}
    heldPeople = {}
    heldAreas = {}
    holdAvailable = True
    person = 0

    # search for repeated areas. Mark them if the next repeat occurs within an hour
    ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
    holds = np.zeros(areas.size, dtype=bool)
    holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

    for area,place,hold in zip(areas, places, holds):
        if (area, place) in assignedPlaces:
            # this unique (area, place) has already been assigned to someone
            assignedPeople.append(assignedPlaces[(area, place)])
            continue

        if assignmentCount[person] >= areasPerPerson:
            # the current person is already assigned to enough areas, move on to the next
            a = heldPeople.pop(person, None)
            heldAreas.pop(a, None)
            person += 1

        if area in heldAreas:
            # assign to the person held in this area
            p = heldAreas.pop(area)
            heldPeople.pop(p)
        else:
            # get the first non-held person. If we need to hold in this area, 
            # also make sure the person has at least 2 free assignment slots,
            # though if it's the last person assign to them anyway 
            p = person
            while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
                p += 1

        assignmentCount.update([p])
        assignedPlaces[(area, place)] = p
        assignedPeople.append(p)

        if hold:
            if p==maxPerson:
                # mark that there are no more people available to perform holds
                holdAvailable = False

            # this area recurrs in an hour, mark that the person should be held here
            heldPeople[p] = area
            heldAreas[area] = p

    return assignedPeople

def allocatePeople(df, areasPerPerson=3):
    assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
    df = df.copy()
    df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
    return df

请注意在allocatePeople中使用df['Person'].unique().处理输入中重复人员的情况.假定输入中人员的顺序是应该分配这些人员的所需顺序.

Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.

我针对OP的示例输入(example1example2)以及我想到的一些边缘情况(认为)符合OP所需的算法对allocatePeople进行了测试:

I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:

ds = dict(
example1 = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],                 
    'Area' : ['A','B','C','D','E','D','E','F','G'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],   
    }),
example2 = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],                 
    'Area' : ['X','X','X','X','X','X','X','X','X'],     
    'On' : ['1','2','3','3','3','3','3','3','3'], 
    'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],   
    }),

long_repeats = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],                 
    'Area' : ['A','A','A','A','B','C','C','C','B'],  
    'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],   
    'On' : ['1','2','3','4','5','6','7','8','9'],                      
    }),
many_repeats = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],                 
    'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],   
    }),
large_gap = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],                 
    'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],   
    }),
different_times = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],                 
    'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],                 
    'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],     
    'On' : ['1','2','3','4','5','6','7','8','9'], 
    'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],   
    })
)

expectedPeoples = dict(
    example1 = [1,1,1,2,3,2,3,2,3],
    example2 = [1,1,1,1,1,1,1,1,1],
    long_repeats = [1,1,1,2,2,3,3,3,2],
    many_repeats = [1,1,1,2,2,3,3,2,3],
    large_gap = [1,1,1,2,3,3,2,2,3],
    different_times = [1,1,1,2,2,2,3,3,3],
)

for name,d in ds.items():
    df = pd.DataFrame(d)
    expected = ['Person %d' % i for i in expectedPeoples[name]]
    ap = allocatePeople(df)

    print(name, ap, sep='\n', end='\n\n')
    np.testing.assert_array_equal(ap['Person'], expected)

assert_array_equal语句通过,并且输出与OP的预期输出匹配:

The assert_array_equal statements pass, and the output matches OP's expected output:

example1
       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 5    E  5  Person 3
5  08:40:00  House 1    D  6  Person 2
6  08:42:00  House 2    E  7  Person 3
7  08:45:00  House 3    F  8  Person 2
8  08:50:00  House 2    G  9  Person 3

example2
      Time    Place Area On    Person
0  8:03:00  House 1    X  1  Person 1
1  8:17:00  House 2    X  2  Person 1
2  8:20:00  House 3    X  3  Person 1
3  8:28:00  House 1    X  3  Person 1
4  8:35:00  House 2    X  3  Person 1
5  8:40:00  House 3    X  3  Person 1
6  8:42:00  House 1    X  3  Person 1
7  8:45:00  House 2    X  3  Person 1
8  8:50:00  House 3    X  3  Person 1

我的测试用例的输出也符合我的期望:

The output for my test cases matches my expectations as well:

long_repeats
      Time    Place Area    Person On
0  8:03:00  House 1    A  Person 1  1
1  8:17:00  House 2    A  Person 1  2
2  8:20:00  House 3    A  Person 1  3
3  8:25:00  House 4    A  Person 2  4
4  8:30:00  House 1    B  Person 2  5
5  8:31:00  House 1    C  Person 3  6
6  8:35:00  House 2    C  Person 3  7
7  8:45:00  House 3    C  Person 3  8
8  8:50:00  House 2    B  Person 2  9

many_repeats
       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 1    D  5  Person 2
5  08:40:00  House 1    E  6  Person 3
6  08:42:00  House 2    E  7  Person 3
7  08:45:00  House 1    F  8  Person 2
8  08:50:00  House 2    F  9  Person 3

large_gap
       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 1    E  5  Person 3
5  08:40:00  House 1    F  6  Person 3
6  08:42:00  House 2    D  7  Person 2
7  08:45:00  House 1    D  8  Person 2
8  08:50:00  House 3    D  9  Person 3

different_times
       Time    Place Area On    Person
0   8:03:00  House 1    A  1  Person 1
1   8:17:00  House 2    B  2  Person 1
2   8:20:00  House 3    C  3  Person 1
3   8:28:00  House 4    D  4  Person 2
4   8:35:00  House 1    D  5  Person 2
5  08:40:00  House 1    E  6  Person 2
6  09:42:00  House 2    E  7  Person 3
7  09:45:00  House 1    F  8  Person 3
8  09:50:00  House 1    G  9  Person 3

让我知道它是否可以完成您想要的一切,或者是否仍需要进行一些调整.我认为每个人都渴望看到您实现您的愿景.

Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.

这篇关于在pandas df中重新分配列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆