如何处理 ValueError:索引包含使用 df.pivot 或 pd.pivot_table 的重复条目? [英] How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?
问题描述
我有一个表格,显示了不同专家(ID
)完成一系列四项任务所花费的累计小时数(dataframe values
)['Task1, 'Tas2', 'Task3, 'Tas4']
像这样:
输入:
ID Task1 Task2 Task3 Task40 10 1 3 4 61 11 1 3 4 52 12 1 3 4 6
现在我想重塑该数据框,以便我可以轻松找出每个专家在 1 小时、2 小时等后正在处理的任务.因此所需的输出如下所示:
所需的输出:
值 1 3 4 5 6ID10 任务1 任务2 任务3 任务3 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4
使用这个特定的数据框,我设法使用 pd.melt()
、pd.pivot()
和 pd.fillna 生成所需的输出()
像这样(完整的代码片段和示例数据):
我尝试过的:
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)
问题是这种方法不是很健壮,因为它很容易被一个会呈现(我认为)重复列名的数据集崩溃.这是一个示例,只需将 ID=0
的 Task3
从 5 更改为 4
:
代码 1
将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 4, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df
代码 1 - 错误:
<块引用>ValueError: 索引包含重复条目,无法重塑
根据文档, pd.pivot_table()是:
<块引用>可以处理一对索引/列的重复值的数据透视表的泛化.
所以我希望 pd.pivot_table()
更适合这种情况.唉,这会触发:
DataError: 没有要聚合的数字类型
有谁知道是否有可能获得处理这些错误的可靠方法?我可能只是以错误的方式使用 pd.pivot_table()
吗?我还尝试包含 aggfunc=None
.
我在这里不知所措,所以任何建议都会很棒!尽管我希望采用 df.pivot
或 pd.pivot_table
和/或尽可能最短的方法.
完整的工作代码示例:
将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 5, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df
df.pivot
和 pd.pivot_table
都失败的完整示例:
将pandas导入为pddf1 = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},'任务 1':{0: 1, 1: 1, 2: 1},'Task2': {0: 4, 1: 3, 2: 3},'Task3': {0: 4, 1: 4, 2: 4},'Task4': {0: 6, 1: 5, 2: 6}})df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])# df = df.pivot(index='ID', columns = 'value', values = 'variable')df = df.pivot_table(index='ID', columns = 'value', values = 'variable')df = df.fillna(方法 = '填充', 轴 = 1)df
您也可以使用 pd.crosstab:
dfm = df.melt('ID', value_name='val')df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)打印(df_out)
输出:
val 1 3 4 5 6ID10 任务1 任务1 任务2 任务2 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4
或者将 aggfunc 更改为last":
dfm = df.melt('ID', value_name='val')df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)df_out
输出:
val 1 3 4 5 6ID10 任务1 任务1 任务3 任务3 任务411 任务1 任务2 任务3 任务4 任务412 任务1 任务2 任务3 任务3 任务4
I've got a table showing the accumulated number of hours (dataframe values
) different specialists (ID
) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4']
like this:
Input:
ID Task1 Task2 Task3 Task4
0 10 1 3 4 6
1 11 1 3 4 5
2 12 1 3 4 6
Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:
Desired output:
value 1 3 4 5 6
ID
10 Task1 Task2 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
With this particular dataframe, I've managed to produce the desired output using pd.melt()
, pd.pivot()
and pd.fillna()
like this (complete snippet with sample data further down):
What I have tried:
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3
for ID=0
from 5 to 4
:
Code 1
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Code 1 - Error:
ValueError: Index contains duplicate entries, cannot reshape
And according to the docs, pd.pivot_table() is a:
generalization of pivot that can handle duplicate values for one index/column pair.
So I was hoping that pd.pivot_table()
would be better suited for this case. Alas, this triggers:
DataError: No numeric types to aggregate
Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table()
the wrong way? I've also tried to include aggfunc=None
.
I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot
or pd.pivot_table
and / or the shortest approach possible.
Complete working code example:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 5, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Complete example where both df.pivot
and pd.pivot_table
fail:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
You can do this also using pd.crosstab:
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task2 Task2 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
Or changing the aggfunc to 'last':
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
这篇关于如何处理 ValueError:索引包含使用 df.pivot 或 pd.pivot_table 的重复条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!