Python - 加快将分类变量转换为数字索引 [英] Python - Speed up for converting a categorical variable to it's numerical index

查看:924
本文介绍了Python - 加快将分类变量转换为数字索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将Pandas数据框中的一列分类变量转换为与索引对应的数值,并将其列入列中的唯一分类变量数组(long story!),这里是一个代码片段:

I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据框:

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

到数据框中:

    col
 0  0.0
 1  1.0
 2  0.0
 3  0.0
 4  1.0

根据需要。但是我的问题是,当我尝试在大数据文件上运行相似的代码时,我的愚蠢的小循环(我想到这样做的唯一方法)就像糖蜜一样慢。我只是好奇,有没有人有任何想法是否有任何办法更有效地做到这一点。感谢提前的任何想法。

as desired. But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files. I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently. Thanks in advance for any thoughts.

推荐答案

使用 factorize

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

文档

编辑:

As Jeff ,那么最好是将列转换为分类主要是因为较少的内存使用情况

As Jeff mentioned in comment, then the best is convert column to categorical mainly because less memory usage:

df['col'] = df['col'].astype("category")

强>时间:

Timings:

有趣的是,大df 熊猫的速度比 numpy 。我不能相信。

It is interesting, in large df pandas is faster as numpy. I cant believe it.

len(df)= 500k

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop  

len(df)= 5k / p>

len(df)=5k:

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)= 5

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

测试代码

Code for testing:

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))  

这篇关于Python - 加快将分类变量转换为数字索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆