尝试使用多个空列(NaN)取消堆叠数据帧 [英] Trying to unstack dataframe with multiple empty columns (NaN)
问题描述
我目前有一个代码可以解决这个问题:
I currently have a code which turns this:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 40 3 150 5
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
进入这个:
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
在这里你可以检查代码:
And here you can check the code:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 50 1 131 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
print(df1)
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
如果 Key 列具有唯一值,程序可以正常运行,但如果存在重复值,则程序会失败.我遇到的问题是,我的实际数据帧的行有 30 个,其他的有 60 个,其他的有 63 个等等.所以程序将空值检测为重复,并且程序失败.
The program works correctly if Key colums have unique values but it fails if there are duplicate values. The issue I have is that my actual dataframe has rows with 30 clumns, other with 60, other with 63, etc. So the program is detecting empty values as duplicate and the program fails.
请检查此示例:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 NaN NaN NaN NaN
1 2.2.2 erto 50 7 40 8 150.0 8.0 131.0 2.0
2 3.3.3 gema 131 2 150 5 40.0 1.0 50.0 3.0
我想得到这样的东西:
ID User 40 50 131 150
0 1.1.1 amba 1 4
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
如果我尝试将其拆开,则会收到错误索引包含重复条目,无法重塑".我一直在阅读这方面的内容,df.drop_duplicates、pivot_tables、tc 可以在这种情况下提供帮助,但我不能用我当前的代码来做任何事情.知道如何解决这个问题吗?谢谢.
If I try to unstack this, i get the error "Index contains duplicate entries, cannot reshape". I have been reading about this and df.drop_duplicates, pivot_tables, tc could help in this situation but I cannot just make work anything of this with my current code. Any idea about how o fix this? Thanks.
推荐答案
想法是将前 2 列转换为 MultiIndex
,然后使用 concat
按选定的对和 DataFrame.iloc
,由 DataFrame.stack
并通过 DataFrame.reset_index
:
Idea is convert first 2 columns to MultiIndex
, then use concat
by selected pair and unpair columns by DataFrame.iloc
, reshaped by DataFrame.stack
and removed third unnecessary level of MultiIndex
by DataFrame.reset_index
:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
最后通过 DataFrame.set_index
并通过 Series.unstack
,通过将
,通过 MultiIndex
转换为列>reset_indexDataFrame.rename_axis
:
Last add key
column to MultiIndex
by DataFrame.set_index
and reshape by Series.unstack
, convert MultiIndex
to columns by reset_index
, rename columns names and last remove columns levels name by DataFrame.rename_axis
:
df = (df.set_index('key', append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
对于第二个示例,它也运行良好,因为 stack
删除了缺失的行,还添加了 rename
以将列名转换为 int
如果可能:
Also it working well for second example, because missing rows are removed by stack
, also added rename
for convert columns names to int
if possible:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
print (df)
key val
A B
1.1.1 amba 50.0 1.0
amba 131.0 4.0
2.2.2 erto 50.0 7.0
erto 40.0 8.0
erto 150.0 8.0
erto 131.0 2.0
3.3.3 gema 131.0 2.0
gema 150.0 5.0
gema 40.0 1.0
gema 50.0 3.0
df = (df.set_index('key', append=True)['val']
.unstack()
.rename(columns=int)
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba NaN 1.0 4.0 NaN
1 2.2.2 erto 8.0 7.0 2.0 8.0
2 3.3.3 gema 1.0 3.0 2.0 5.0
EDIT1 添加了带有计数器的辅助列以避免重复:
EDIT1 Added helper column with counter for avoid duplicates:
print (df)
A B C D E F G H I J
0 1.1.1 amba 50 1 50 4 40 3 150 5 <- E=50
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
df['g'] = df.groupby(['A','B','key']).cumcount()
print (df)
key val g
A B
1.1.1 amba 50 1 0
amba 50 4 1
amba 40 3 0
amba 150 5 0
2.2.2 erto 50 7 0
erto 40 8 0
erto 150 8 0
erto 131 2 0
3.3.3 gema 131 2 0
gema 150 5 0
gema 40 1 0
gema 50 3 0
df = (df.set_index(['g','key'], append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User g 40 50 131 150
0 1.1.1 amba 0 3.0 1.0 NaN 5.0
1 1.1.1 amba 1 NaN 4.0 NaN NaN
2 2.2.2 erto 0 8.0 7.0 2.0 8.0
3 3.3.3 gema 0 1.0 3.0 2.0 5.0
这篇关于尝试使用多个空列(NaN)取消堆叠数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!