遍历包含嵌套数组的pandas dataframe列 [英] Iterate over pandas dataframe columns containing nested arrays
问题描述
希望您能为我解决这个问题,
I hope you can help me with this issue,
我在下面有此数据(列名称不限)
I've this data below (Columns names whatever)
data=([['file0090',
([[ 84, 55, 189],
[248, 100, 18],
[ 68, 115, 88]])],
['file6565',
([[ 86, 58, 189],
[24, 10, 118],
[ 68, 11, 8]])
]])
我需要遍历第0列和第1列进入可以转换为Dataframe的排序列表输出如下:
I need to iterate over columns 0 and 1 into a list in sort I can transform into a Dataframe with this output:
col0 col1 col2 col3
file0090 84 55 189
file0090 248 100 1
file0090 68 115 88
file6565 86 58 189
file6565 24 10 118
file6565 68 11 8
我已经用迭代,迭代项,项目,并追加到列表中,但结果始终围绕相同的输出,但我不知道项目与这些数组之间的分隔程度如何
I've tested all dataframe iteration with iterrows, iteritems, items, and append into a list but the results always turn around the same output and I dont get how separate the items form these arrays
如果可以帮助的话,请先谢谢您.
thank you in advance if you can help.
推荐答案
您可以创建一个自定义函数来输出正确形式的数据.
You can create a custom function to output the correct form of data.
from itertools import chain
def transform(d):
for l in d:
*x, y = l
yield list(map(lambda s: x+s, y))
df = pd.DataFrame(chain(*transform(data)))
df
0 1 2 3
0 file0090 84 55 189
1 file0090 248 100 18
2 file0090 68 115 88
3 file6565 86 58 189
4 file6565 24 10 118
5 file6565 68 11 8
所有解决方案的时间结果:
Timeit results of all the solutions:
# YOBEN_S's answer
In [275]: %%timeit
...: s = pd.DataFrame(data).set_index(0)[1].explode()
...: df = pd.DataFrame(s.tolist(), index = s.index.values)
...:
...:
1.52 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Anky's answer
In [276]: %%timeit
...: df = pd.DataFrame(data).add_prefix('col')
...: out = df.explode('col1').reset_index(drop=True)
...: out = out.join(pd.DataFrame(out.pop('col1').tolist()).add_prefix('col_'))
...:
...:
3.71 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Dhaval's answer
In [277]: %%timeit
...: data_f = []
...: for i in data:
...: for j in i[1]:
...: data_f.append([i[0]]+j)
...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
...:
...:
712 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#My answer
In [280]: %%timeit
...: pd.DataFrame(chain(*transform(data)))
...:
...:
489 µs ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Using List comp of Dhaval's answer
In [306]: %%timeit
...: data_f = [[i[0]]+j for i in data for j in i[1]]
...: df = pd.DataFrame(data_f, columns =['col0','col1','col2','col3'])
...:
...:
586 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Anky's 2nd solution
In [308]: %%timeit
...: l = [*chain.from_iterable(data)]
...: pd.DataFrame(np.vstack(l[1::2]),index = np.repeat(l[::2],len(l[1])))
...:
...:
221 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这篇关于遍历包含嵌套数组的pandas dataframe列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!