尝试使用多个空列(NaN)取消堆叠数据帧 [英] Trying to unstack dataframe with multiple empty columns (NaN)

查看:44
本文介绍了尝试使用多个空列(NaN)取消堆叠数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个代码可以解决这个问题:

I currently have a code which turns this:

       A     B    C  D    E  F    G  H    I  J
0  1.1.1  amba   50  1  131  4   40  3  150  5
1  2.2.2  erto   50  7   40  8  150  8  131  2
2  3.3.3  gema  131  2  150  5   40  1   50  3

进入这个:

      ID  User  40  50  131  150
0  1.1.1  amba   3   1    4    5
1  2.2.2  erto   8   7    2    8
2  3.3.3  gema   1   3    2    5

在这里你可以检查代码:

And here you can check the code:

import pandas as pd
import io

df1 = pd.read_csv(io.StringIO("""  A       B       C      D      E      F      G      H      I       J     
1.1.1   amba     50      1     131     4      40       3       150       5   
2.2.2   erto     50      7     40      8      150      8       131       2
3.3.3   gema     131     2    150      5      40       1        50       3"""), sep="\s+")



print(df1)


df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
           df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
           df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
           df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
          ])      
 .rename(columns={"A":"ID","B":"User"})
 .set_index(["ID","User","key"])
 .unstack(2)
 .reset_index()
)

# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2

如果 Key 列具有唯一值,程序可以正常运行,但如果存在重复值,则程序会失败.我遇到的问题是,我的实际数据帧的行有 30 个,其他的有 60 个,其他的有 63 个等等.所以程序将空值检测为重复,并且程序失败.

The program works correctly if Key colums have unique values but it fails if there are duplicate values. The issue I have is that my actual dataframe has rows with 30 clumns, other with 60, other with 63, etc. So the program is detecting empty values as duplicate and the program fails.

请检查此示例:

       A     B    C  D    E  F      G    H      I    J
0  1.1.1  amba   50  1  131  4    NaN  NaN    NaN  NaN
1  2.2.2  erto   50  7   40  8  150.0  8.0  131.0  2.0
2  3.3.3  gema  131  2  150  5   40.0  1.0   50.0  3.0

我想得到这样的东西:

      ID  User  40  50  131  150
0  1.1.1  amba       1    4    
1  2.2.2  erto   8   7    2    8
2  3.3.3  gema   1   3    2    5

如果我尝试将其拆开,则会收到错误索引包含重复条目,无法重塑".我一直在阅读这方面的内容,df.drop_duplicates、pivot_tables、tc 可以在这种情况下提供帮助,但我不能用我当前的代码来做任何事情.知道如何解决这个问题吗?谢谢.

If I try to unstack this, i get the error "Index contains duplicate entries, cannot reshape". I have been reading about this and df.drop_duplicates, pivot_tables, tc could help in this situation but I cannot just make work anything of this with my current code. Any idea about how o fix this? Thanks.

推荐答案

想法是将前 2 列转换为 MultiIndex,然后使用 concat 按选定的对和 DataFrame.iloc,由 DataFrame.stack 并通过 DataFrame.reset_index:

Idea is convert first 2 columns to MultiIndex, then use concat by selected pair and unpair columns by DataFrame.iloc, reshaped by DataFrame.stack and removed third unnecessary level of MultiIndex by DataFrame.reset_index:

df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
               df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)], 
               axis=1, keys=('key','val'))

最后通过 DataFrame.set_index 并通过 Series.unstack,通过MultiIndex转换为列>reset_index,通过 DataFrame.rename_axis:

Last add key column to MultiIndex by DataFrame.set_index and reshape by Series.unstack, convert MultiIndex to columns by reset_index, rename columns names and last remove columns levels name by DataFrame.rename_axis:

df = (df.set_index('key', append=True)['val']
        .unstack()
        .reset_index()
        .rename(columns={"A":"ID","B":"User"})
        .rename_axis(None, axis=1))
print (df)
      ID  User  40  50  131  150
0  1.1.1  amba   3   1    4    5
1  2.2.2  erto   8   7    2    8
2  3.3.3  gema   1   3    2    5

对于第二个示例,它也运行良好,因为 stack 删除了缺失的行,还添加了 rename 以将列名转换为 int 如果可能:

Also it working well for second example, because missing rows are removed by stack, also added rename for convert columns names to int if possible:

df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
               df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)], 
               axis=1, keys=('key','val'))


print (df)
              key  val
A     B               
1.1.1 amba   50.0  1.0
      amba  131.0  4.0
2.2.2 erto   50.0  7.0
      erto   40.0  8.0
      erto  150.0  8.0
      erto  131.0  2.0
3.3.3 gema  131.0  2.0
      gema  150.0  5.0
      gema   40.0  1.0
      gema   50.0  3.0


df = (df.set_index('key', append=True)['val']
        .unstack()
        .rename(columns=int)
        .reset_index()
        .rename(columns={"A":"ID","B":"User"})
        .rename_axis(None, axis=1))
print (df)
      ID  User   40   50  131  150
0  1.1.1  amba  NaN  1.0  4.0  NaN
1  2.2.2  erto  8.0  7.0  2.0  8.0
2  3.3.3  gema  1.0  3.0  2.0  5.0

EDIT1 添加了带有计数器的辅助列以避免重复:

EDIT1 Added helper column with counter for avoid duplicates:

print (df)
       A     B    C  D    E  F    G  H    I  J
0  1.1.1  amba   50  1   50  4   40  3  150  5 <- E=50
1  2.2.2  erto   50  7   40  8  150  8  131  2
2  3.3.3  gema  131  2  150  5   40  1   50  3

df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
               df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)], 
               axis=1, keys=('key','val'))

df['g'] = df.groupby(['A','B','key']).cumcount()
print (df)
            key  val  g
A     B                
1.1.1 amba   50    1  0
      amba   50    4  1
      amba   40    3  0
      amba  150    5  0
2.2.2 erto   50    7  0
      erto   40    8  0
      erto  150    8  0
      erto  131    2  0
3.3.3 gema  131    2  0
      gema  150    5  0
      gema   40    1  0
      gema   50    3  0


df = (df.set_index(['g','key'], append=True)['val']
        .unstack()
        .reset_index()
        .rename(columns={"A":"ID","B":"User"})
        .rename_axis(None, axis=1))
print (df)
      ID  User  g   40   50  131  150
0  1.1.1  amba  0  3.0  1.0  NaN  5.0
1  1.1.1  amba  1  NaN  4.0  NaN  NaN
2  2.2.2  erto  0  8.0  7.0  2.0  8.0
3  3.3.3  gema  0  1.0  3.0  2.0  5.0

这篇关于尝试使用多个空列(NaN)取消堆叠数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆