在 Pandas DataFrame 中取消嵌套(分解)多个列表列的有效方法 [英] Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

查看:37
本文介绍了在 Pandas DataFrame 中取消嵌套(分解)多个列表列的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将多个 JSON 对象读入一个 DataFrame.问题是某些列是列表.此外,数据非常大,因此我无法使用互联网上的可用解决方案.它们非常慢且内存效率低

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

我的数据如下所示:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

这是我的数据的形状:(441079, 12)

And this is the shape of my data: (441079, 12)

我想要的输出是:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

在被标记为重复后,我想强调一个事实,即在这个问题中,我正在寻找一种高效分解多列的方法.因此,批准的答案能够在非常大的数据集上有效地分解任意数量的列.另一个问题的答案没有做到这一点(这就是我在测试这些解决方案后问这个问题的原因).

After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

推荐答案

pandas >= 0.25

假设所有列都有相同数量的列表,您可以调用 Series.explode 在每列上.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

这个想法是将所有必须NOT首先分解的列设置为索引,然后重置索引.

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.

它也更快.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这篇关于在 Pandas DataFrame 中取消嵌套(分解)多个列表列的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆