遍历包含嵌套数组的pandas dataframe列 [英] Iterate over pandas dataframe columns containing nested arrays

查看:67
本文介绍了遍历包含嵌套数组的pandas dataframe列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望您能为我解决这个问题,

I hope you can help me with this issue,

我在下面有此数据(列名称不限)

I've this data below (Columns names whatever)

data=([['file0090',
    ([[ 84,  55, 189],
   [248, 100,  18],
   [ 68, 115,  88]])],
   ['file6565',
    ([[ 86,  58, 189],
   [24, 10,  118],
   [ 68, 11,  8]])
   ]])

我需要遍历第0列和第1列进入可以转换为Dataframe的排序列表输出如下:

I need to iterate over columns 0 and 1 into a list in sort I can transform into a Dataframe with this output:

col0          col1  col2   col3 
file0090      84     55     189
file0090      248    100      1
file0090      68     115    88
file6565      86     58    189
file6565      24    10     118
file6565      68    11      8

我已经用迭代,迭代项,项目,并追加到列表中,但结果始终围绕相同的输出,但我不知道项目与这些数组之间的分隔程度如何

I've tested all dataframe iteration with iterrows, iteritems, items, and append into a list but the results always turn around the same output and I dont get how separate the items form these arrays

如果可以帮助的话,请先谢谢您.

thank you in advance if you can help.

推荐答案

您可以创建一个自定义函数来输出正确形式的数据.

You can create a custom function to output the correct form of data.

from itertools import chain
def transform(d):
    for l in d:
        *x, y = l
        yield list(map(lambda s: x+s, y))

df = pd.DataFrame(chain(*transform(data)))
df
          0    1    2    3
0  file0090   84   55  189
1  file0090  248  100   18
2  file0090   68  115   88
3  file6565   86   58  189
4  file6565   24   10  118
5  file6565   68   11    8

所有解决方案的时间结果:

Timeit results of all the solutions:

# YOBEN_S's answer
In [275]: %%timeit
     ...: s = pd.DataFrame(data).set_index(0)[1].explode()
     ...: df = pd.DataFrame(s.tolist(), index = s.index.values)
     ...:
     ...:
1.52 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Anky's answer
In [276]: %%timeit
     ...: df = pd.DataFrame(data).add_prefix('col')
     ...: out = df.explode('col1').reset_index(drop=True)
     ...: out = out.join(pd.DataFrame(out.pop('col1').tolist()).add_prefix('col_'))
     ...:
     ...:
3.71 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Dhaval's answer
In [277]: %%timeit
     ...: data_f = []
     ...: for i in data:
     ...:     for j in i[1]:
     ...:         data_f.append([i[0]]+j)
     ...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
     ...:
     ...:
712 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#My answer
In [280]: %%timeit
     ...: pd.DataFrame(chain(*transform(data)))
     ...:
     ...:
489 µs ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Using List comp of Dhaval's answer

In [306]: %%timeit
     ...: data_f = [[i[0]]+j for i in data for j in i[1]]
     ...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
     ...:
     ...:
586 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Anky's 2nd solution

In [308]: %%timeit
     ...: l = [*chain.from_iterable(data)]
     ...: pd.DataFrame(np.vstack(l[1::2]),index = np.repeat(l[::2],len(l[1])))
     ...:
     ...:
221 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这篇关于遍历包含嵌套数组的pandas dataframe列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆