Python - 加快将分类变量转换为数字索引 [英] Python - Speed up for converting a categorical variable to it's numerical index

查看：924 发布时间：2017/3/26 1:58:41 python performance numpy pandas dataframe

本文介绍了Python - 加快将分类变量转换为数字索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将Pandas数据框中的一列分类变量转换为与索引对应的数值，并将其列入列中的唯一分类变量数组（long story！），这里是一个代码片段：

I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据框：

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

到数据框中：

根据需要。但是我的问题是，当我尝试在大数据文件上运行相似的代码时，我的愚蠢的小循环（我想到这样做的唯一方法）就像糖蜜一样慢。我只是好奇，有没有人有任何想法是否有任何办法更有效地做到这一点。感谢提前的任何想法。

as desired. But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files. I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently. Thanks in advance for any thoughts.

推荐答案

使用 factorize ：

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

文档

编辑：

As Jeff ，那么最好是将列转换为分类主要是因为较少的内存使用情况：

As Jeff mentioned in comment, then the best is convert column to categorical mainly because less memory usage:

df['col'] = df['col'].astype("category")

强>时间：

Timings:

有趣的是，大df 熊猫的速度比 numpy 。我不能相信。

It is interesting, in large df pandas is faster as numpy. I cant believe it.

len（df）= 500k ：

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop

len（df）= 5k / p>

len(df)=5k:

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len（df）= 5

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

测试代码

Code for testing:

d = {'col': ["baked","beans","baked","baked","beans"]} df = pd.DataFrame(data=d) print (df) df = pd.concat([df]*100000).reset_index(drop=True) #test for 5k #df = pd.concat([df]*1000).reset_index(drop=True) df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy() def a(df): df['col'] = pd.factorize(df.col)[0] return df def a1(df): idx,_ = pd.factorize(df.col) df['col'] = idx return df def b(df): df['col'] = np.unique(df['col'],return_inverse=True)[1] return df def b1(df): _,idx = np.unique(df['col'],return_inverse=True) df['col'] = idx return df print (a(df1)) print (a1(df2)) print (b(df3)) print (b1(df4))

这篇关于Python - 加快将分类变量转换为数字索引的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python - 加快将分类变量转换为数字索引 [英] Python - Speed up for converting a categorical variable to it's numerical index

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python - 加快将分类变量转换为数字索引 [英] Python - Speed up for converting a categorical variable to it&#39;s numerical index

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python - 加快将分类变量转换为数字索引 [英] Python - Speed up for converting a categorical variable to it's numerical index

登录关闭