如何加快LabelEncoder将类别变量重新编码为整数的速度 [英] How to speed LabelEncoder up recoding a categorical variable into integers

查看:371
本文介绍了如何加快LabelEncoder将类别变量重新编码为整数的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的csv,这种格式每行有两个字符串:

I have a large csv with two strings per row in this form:

g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h

我读了前两列,并将字符串重新编码为整数,如下所示:

I read in the first two columns and recode the strings to integers as follows:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)

# Convert to digits.
df = df.apply(le.transform)

此代码来自 https://stackoverflow.com/a/39419342/2179021 .

代码工作得很好,但是当df大时速度很慢.我为每个步骤计时,结果令我惊讶.

The code works very well but is slow when df is large. I timed each step and the result was surprising to me.

  • pd.read_csv大约需要40秒.
  • le.fit(df.values.flat)大约需要30秒
  • df = df.apply(le.transform)大约需要250秒.
  • pd.read_csv takes about 40 seconds.
  • le.fit(df.values.flat) takes about 30 seconds
  • df = df.apply(le.transform) takes about 250 seconds.

有什么办法可以加快最后一步的速度吗?感觉这应该是所有人中最快的一步!

Is there any way to speed up this last step? It feels like it should be the fastest step of them all!

在具有4GB RAM的计算机上进行记录步骤的更多时间

maxymoo在下面的回答很快,但是没有给出正确的答案.以问题顶部的示例csv为例,将其转换为:

The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:

   0  1
0  4  6
1  0  4
2  2  5
3  6  3
4  3  5
5  5  4
6  1  1
7  3  2
8  5  0
9  3  4

请注意,"d"在第一列中映射为3,在第二列中映射为2.

Notice that 'd' is mapped to 3 in the first column but 2 in the second.

我尝试了 https://stackoverflow.com/a/39356398/2179021 上的解决方案,并获得了以下内容.

I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop

然后我将数据框的大小增加了10倍.

Then I increased the dataframe size by a factor of 10.

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) 
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError                               Traceback (most recent call last)

此方法似乎使用了太多RAM来尝试转换相对较小的数据帧,从而导致崩溃.

This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.

我还为LabelEncoder设置了具有1000万行的较大数据集的时间.它运行时不会崩溃,但仅合身线就花了50秒钟. df.apply(le.transform)步骤耗时约80秒.

I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.

我该怎么办

  1. 大致获得maxymoo的回答速度和LabelEncoder的大致内存使用情况,但是当数据帧具有两列时,这将给出正确的答案.
  2. 存储映射,以便我可以将其重用于其他数据(如LabelEncoder允许的方式)?

推荐答案

使用pandas category数据类型看起来会更快.内部使用哈希表,而LabelEncoder使用排序搜索:

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:

In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 
                            'ID_1':np.random.randint(0,1000,1000000)}).astype(str)

In [88]: le.fit(df.values.flat) 
         %time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s

In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms

编辑:这是一个您可以使用的自定义转换器类(由于维护者不希望将熊猫作为一个维护者,因此在正式的scikit-learn版本中您可能不会看到它.依赖性)

Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)

import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin

class PandasLabelEncoder(BaseEstimator, TransformerMixin):
    def fit(self, y):
        self.classes_ = unique1d(y)
        return self

    def transform(self, y):
        s = pd.Series(y).astype('category', categories=self.classes_)
        return s.cat.codes

这篇关于如何加快LabelEncoder将类别变量重新编码为整数的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆