如何取消嵌套(分解)pandas DataFrame 中的列 [英] How to unnest (explode) a column in a pandas DataFrame

查看:48
本文介绍了如何取消嵌套(分解)pandas DataFrame 中的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 DataFrame,其中一列是对象(列表类型单元格):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})df出[458]:甲乙0 1 [1, 2]1 2 [1, 2]

我的预期输出是:

 A B0 1 11 1 23 2 14 2 2

我应该怎么做才能实现这一目标?

<小时>

相关问题

pandas:当单元格内容为列表时,为列表中的每个元素创建一行

很好的问答,但只处理一列列表(在我的回答中,self-def 函数适用于多列,也接受的答案是使用最耗时的 apply ,这是不推荐,查看更多信息 什么时候应该我曾经想在我的代码中使用 pandas apply() 吗?)

解决方案

我知道 objecttype 使数据很难用 pandas 转换代码>功能.当我收到这样的数据时,首先想到的是展平"或取消嵌套列.

我正在使用 pandaspython 函数来解决此类问题.如果您担心上述解决方案的速度,请查看 user3483203 的 answer,因为它使用的是 numpy 和大多数时候 numpy 更快.如果速度很重要,我推荐 Cpythonnumba.


方法 0 [pandas >= 0.25]
pandas 0.25<开始/a>,如果只需要爆一个列,可以使用pandas.DataFrame.explode 函数:

df.explode('B')甲乙0 1 11 1 20 2 11 2 2

给定一个在列中带有空 listNaN 的数据框.空列表不会导致问题,但是 NaN 需要用 list

填充

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})df.B = df.B.fillna({i: [] for i in df.index}) # 用 [] 替换 NaNdf.explode('B')甲乙0 1 10 1 21 2 11 2 22 3 南3 4 南


方法一
apply + pd.Series(容易理解但不推荐在性能方面.)

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})出[463]:甲乙0 1 11 1 20 2 11 2 2


方法二
使用 repeatDataFrame 构造函数,重新创建你的数据框(性能好,多列不好)

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})df出[465]:甲乙0 1 10 1 21 2 11 2 2

方法 2.1
例如,除了 A 我们还有 A.1 .....A.n.如果我们仍然使用上面的方法(方法2),我们很难一一重新创建列.

解决方案:joinmergeindex 'unnest' 单列后

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))s.join(df.drop('B',1),how='left')出[477]:乙0 1 10 2 11 1 21 2 2

如果您需要与之前完全相同的列顺序,请在末尾添加reindex.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)


方法三
重新创建list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)出[488]:甲乙0 1 11 1 22 2 13 2 2

如果超过两列,使用

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])s.merge(df,left_on=0,right_index=True)出[491]:0 1 AB0 0 1 1 [1, 2]1 0 2 1 [1, 2]2 1 1 2 [1, 2]3 1 2 2 [1, 2]


方法四
使用 reindexloc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))出[554]:甲乙0 1 10 1 21 2 11 2 2#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


方法五
当列表只包含唯一值时:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})从集合导入 ChainMapd = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))pd.DataFrame(list(d.items()),columns=df.columns[::-1])出[574]:乙0 1 11 2 12 3 23 4 2


方法 6
使用 numpy 获得高性能:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))pd.DataFrame(data=newvalues[0],columns=df.columns)甲乙0 1 11 1 22 2 13 2 2


方法 7
使用基本函数itertools cyclechain:纯粹的python 解决方案,只是为了好玩

from itertools import cycle,chainl=df.values.tolist()l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x])[0]]), x[1]))) for x in l]pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)甲乙0 1 11 1 22 2 13 2 2


推广到多列

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2]],[3,4]]})df出[592]:乙丙0 1 [1, 2] [1, 2]1 2 [3, 4] [3, 4]

自定义功能:

def 取消嵌套(df,explode):idx = df.index.repeat(df[explode[0]].str.len())df1 = pd.concat([pd.DataFrame({x: np.concatenate(df[x].values)}) for x in爆炸],axis=1)df1.index = idx返回 df1.join(df.drop(explode, 1), how='left')取消嵌套(df,['B','C'])出[609]:乙丙0 1 1 10 2 2 11 3 3 21 4 4 2


按列取消嵌套

以上所有方法都在谈论垂直取消嵌套和爆炸,如果您确实需要扩展列表水平, 检查pd.DataFrame 构造函数

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))出[33]:A B C B_0 B_10 1 [1, 2] [1, 2] 1 21 2 [3, 4] [3, 4] 3 4

功能更新

def 取消嵌套(df,explode,axis):如果轴==1:idx = df.index.repeat(df[explode[0]].str.len())df1 = pd.concat([pd.DataFrame({x: np.concatenate(df[x].values)}) for x in爆炸],axis=1)df1.index = idx返回 df1.join(df.drop(explode, 1), how='left')别的 :df1 = pd.concat([pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in expand],axis=1)返回 df1.join(df.drop(explode, 1), how='left')

测试输出

unnesting(df, ['B','C'], axis=0)出[36]:B0 B1 C0 C1 A0 1 2 1 2 11 3 4 3 4 2

<块引用>

2021-02-17 更新原始爆炸功能

def 取消嵌套(df,explode,axis):如果轴==1:df1 = pd.concat([df[x].explode() for x in爆炸],axis=1)返回 df1.join(df.drop(explode, 1), how='left')别的 :df1 = pd.concat([pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in expand],axis=1)返回 df1.join(df.drop(explode, 1), how='left')

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

   A  B
0  1  1
1  1  2
3  2  1
4  2  2

What should I do to achieve this?


Related question

pandas: When cell contents are lists, create a row for each element in the list

Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)

解决方案

I know object columns type makes the data hard to convert with a pandas function. When I received the data like this, the first thing that came to mind was to 'flatten' or unnest the columns .

I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer, since it's using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters.


Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode('B')

       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index})  # replace NaN with []
df.explode('B')

   A    B
0  1    1
0  1    2
1  2    1
1  2    2
2  3  NaN
3  4  NaN


Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]: 
   A  B
0  1  1
1  1  2
0  2  1
1  2  2


Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
   B  A
0  1  1
0  2  1
1  1  2
1  2  2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)


Method 3
recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
   0  1  A       B
0  0  1  1  [1, 2]
1  0  2  1  [1, 2]
2  1  1  2  [1, 2]
3  1  2  2  [1, 2]


Method 4
using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


Method 5
when the list only contains unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
   B  A
0  1  1
1  2  1
2  3  2
3  4  2


Method 6
using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Method 7
using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Generalizing to multiple columns

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]: 
   A       B       C
0  1  [1, 2]  [1, 2]
1  2  [3, 4]  [3, 4]

Self-def function:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

        
unnesting(df,['B','C'])
Out[609]: 
   B  C  A
0  1  1  1
0  2  2  1
1  3  3  2
1  4  4  2


Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

Update 2021-02-17 with original explode function

def unnesting(df, explode, axis):
    if axis==1:
        df1 = pd.concat([df[x].explode() for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

这篇关于如何取消嵌套(分解)pandas DataFrame 中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆