如何处理 ValueError:索引包含使用 df.pivot 或 pd.pivot_table 的重复条目? [英] How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

查看:74
本文介绍了如何处理 ValueError:索引包含使用 df.pivot 或 pd.pivot_table 的重复条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表格,显示了不同专家(ID)完成一系列四项任务所花费的累计小时数(dataframe values)['Task1, 'Tas2', 'Task3, 'Tas4'] 像这样:

输入:

 ID Task1 Task2 Task3 Task40 10 1 3 4 61 11 1 3 4 52 12 1 3 4 6

现在我想重塑该数据框,以便我可以轻松找出每个专家在 1 小时、2 小时等后正在处理的任务.因此所需的输出如下所示:

所需的输出:

值 1 3 4 5 6ID10 任务1 任务2 任务3 任务3 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4

使用这个特定的数据框,我设法使用 pd.melt()pd.pivot()pd.fillna 生成所需的输出() 像这样(完整的代码片段和示例数据):

我尝试过的:

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)

问题是这种方法不是很健壮,因为它很容易被一个会呈现(我认为)重复列名的数据集崩溃.这是一个示例,只需将 ID=0Task35 更改为 4:

代码 1

将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 4, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df

代码 1 - 错误:

<块引用>

ValueError: 索引包含重复条目,无法重塑

根据文档, pd.pivot_table()是:

<块引用>

可以处理一对索引/列的重复值的数据透视表的泛化.

所以我希望 pd.pivot_table() 更适合这种情况.唉,这会触发:

<块引用>

DataError: 没有要聚合的数字类型

有谁知道是否有可能获得处理这些错误的可靠方法?我可能只是以错误的方式使用 pd.pivot_table() 吗?我还尝试包含 aggfunc=None.

我在这里不知所措,所以任何建议都会很棒!尽管我希望采用 df.pivotpd.pivot_table 和/或尽可能最短的方法.

完整的工作代码示例:

将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 5, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df

df.pivotpd.pivot_table 都失败的完整示例:

将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 4, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])# df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.pivot_table(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df

解决方案

您也可以使用 pd.crosstab:

dfm = df.melt('ID', value_name='val')df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)打印(df_out)

输出:

val 1 3 4 5 6ID10 任务1 任务1 任务2 任务2 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4

或者将 aggfunc 更改为last":

dfm = df.melt('ID', value_name='val')df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)df_out

输出:

val 1 3 4 5 6ID10 任务1 任务1 任务3 任务3 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4

I've got a table showing the accumulated number of hours (dataframe values) different specialists (ID) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this:

Input:

    ID  Task1   Task2   Task3   Task4
0   10      1       3       4       6
1   11      1       3       4       5
2   12      1       3       4       6

Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:

Desired output:

value   1       3       4       5       6
ID                  
10  Task1   Task2   Task3   Task3   Task4
11  Task1   Task2   Task3   Task4   Task4
12  Task1   Task2   Task3   Task3   Task4

With this particular dataframe, I've managed to produce the desired output using pd.melt(), pd.pivot() and pd.fillna() like this (complete snippet with sample data further down):

What I have tried:

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)

The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3 for ID=0 from 5 to 4:

Code 1

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Code 1 - Error:

ValueError: Index contains duplicate entries, cannot reshape

And according to the docs, pd.pivot_table() is a:

generalization of pivot that can handle duplicate values for one index/column pair.

So I was hoping that pd.pivot_table() would be better suited for this case. Alas, this triggers:

DataError: No numeric types to aggregate

Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table() the wrong way? I've also tried to include aggfunc=None.

I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot or pd.pivot_table and / or the shortest approach possible.

Complete working code example:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 5, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Complete example where both df.pivot and pd.pivot_table fail:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')

df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

解决方案

You can do this also using pd.crosstab:

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task2  Task2  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

Or changing the aggfunc to 'last':

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task3  Task3  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

这篇关于如何处理 ValueError:索引包含使用 df.pivot 或 pd.pivot_table 的重复条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆