大 pandas 更快的一系列列表展开以进行一键编码? [英] pandas faster series of lists unrolling for one-hot encoding?

查看:74
本文介绍了大 pandas 更快的一系列列表展开以进行一键编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从具有许多数组类型列的数据库中读取数据,其中pd.read_sql给我一个数据帧,其中的列为dtype=object,包含列表.

I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.

我想要一种有效的方法来查找哪些行具有包含某些元素的数组:

I'd like an efficient way to find which rows have arrays containing some element:

s = pd.Series(
    [[1,2,3], [1,2], [99], None, [88,2]]
)
print s

..

0    [1, 2, 3]
1       [1, 2]
2         [99]
3         None
4      [88, 2]

用于ML应用程序的1个热编码的特征表,我想以诸如以下的表结尾:

1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:

   contains_1 contains_2, contains_3 contains_88
0  1          ...
1  1
2  0
3  nan
4  0
...

我可以像这样展开一系列数组:

I can unroll a series of arrays like so:

s2 = s.apply(pd.Series).stack()

0  0     1.0
   1     2.0
   2     3.0
1  0     1.0
   1     2.0
2  0    99.0
4  0    88.0
   1     2.0

这使我能够找到满足某些测试要求的元素:

which gets me at the being able to find the elements meeting some test:

>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')

哇!此步骤:

s.apply(pd.Series).stack()

产生了一个很好的中间数据结构(s2),可以快速迭代每个类别.但是,apply步骤非常慢(对于具有500k行且包含10项项目的单列,很多列需要10秒的时间),并且我有很多列.

produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.

更新:以一系列列表开始的数据似乎很慢.在SQL端执行展开操作似乎很棘手(我想展开许多列).有没有办法将数组数据提取到更好的结构中?

Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?

推荐答案

import numpy as np
import pandas as pd
import cytoolz

s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n

s1 = pd.Series(c, [i.repeat(l), k])

更新:什么对我有用...

UPDATE: What worked for me...

def unroll(s):
    s = s.dropna()
    v = s.values.tolist()
    c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
    i = s.index
    lens = np.array([len(x) for x in v]) #s.apply(len) is slower
    n = np.append(0, lens[:-1]).cumsum().repeat(lens)
    k = np.arange(sum(lens)) - n

    s = pd.Series(c)
    s.index = [i.repeat(lens), k]

    s = s.dropna()
    return s

应该可以替换:

    s = pd.Series(c)
    s.index = [i.repeat(lens), k]

具有:

        s = pd.Series(c, index=[i.repeat(lens), k])

但这不起作用. (可以在此处)

But this doesn't work. (Says is ok here )

这篇关于大 pandas 更快的一系列列表展开以进行一键编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆