如何有效地重新编码和计数 [英] How to recode and count efficiently

查看:86
本文介绍了如何有效地重新编码和计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的csv,这种形式每行三个字符串:

I have a large csv with three strings per row in this form:

a,c,d
c,a,e
f,g,f
a,c,b
c,a,d
b,f,s
c,a,c

我阅读了前两列中的字符串,将其重新编码为整数,然后删除重复项,计数每行的副本数,如下所示:

I read in the first two columns recode the strings to integers and then remove duplicates counting how many copies of each row there were as follows:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
letters = set(df.values.flat)
df.replace(to_replace=letters, value=range(len(letters)), inplace=True)
df1 = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
print df1

这给出了:

   ID_0  ID_1  count
0     0     1      2
1     1     0      3
2     2     4      1
3     4     3      1

这正是我所需要的.

尽管我的数据很大,但我还是要进行两项改进.

However as my data is large I would like to make two improvements.

  • 我该如何进行groupby,然后重新编码,而不是相反?问题是我不能执行df1[['ID_0','ID_0']].replace(to_replace=letters, value=range(len(letters)), inplace = True).这给出了错误

正在尝试从DataFrame的切片副本上设置一个值"

"A value is trying to be set on a copy of a slice from a DataFrame"

  • 如何避免创建df1?那就是整个事情就地完成.
  • 推荐答案

    我喜欢使用 sklearn.preprocessing.LabelEncoder 进行字母到数字的转换:

    I like to use sklearn.preprocessing.LabelEncoder to do the letter to digit conversion:

    from sklearn.preprocessing import LabelEncoder
    
    # Perform the groupby (before converting letters to digits).
    df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
    
    # Initialize the LabelEncoder.
    le = LabelEncoder()
    le.fit(df[['ID_0', 'ID_1']].values.flat)
    
    # Convert to digits.
    df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
    

    结果输出:

       ID_0  ID_1  count
    0     0     2      2
    1     1     3      1
    2     2     0      3
    3     3     4      1
    

    如果要在以后转换回字母,可以使用le.inverse_transform:

    If you want to convert back to letters at a later point in time, you can use le.inverse_transform:

    df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.inverse_transform)
    

    哪个会按预期映射回去:

    Which maps back as expected:

      ID_0 ID_1  count
    0    a    c      2
    1    b    f      1
    2    c    a      3
    3    f    g      1
    

    如果只想知道哪个数字对应哪个字母,可以查看le.classes_属性.这将为您提供一个字母数组,该字母数组由它编码为的数字索引:

    If you just want to know which digit corresponds to which letter, you can look at the le.classes_ attribute. This will give you an array of letters, which is indexed by the digit it encodes to:

    le.classes_ 
    
    ['a' 'b' 'c' 'f' 'g']
    

    要获得更直观的表示,可以将其投射为系列:

    For a more visual representation, you can cast as a Series:

    pd.Series(le.classes_)
    
    0    a
    1    b
    2    c
    3    f
    4    g
    

    时间

    使用更大版本的示例数据和以下设置:

    Using a larger version of the sample data and the following setup:

    df2 = pd.concat([df]*10**5, ignore_index=True)
    
    def root(df):
        df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
        le = LabelEncoder()
        le.fit(df[['ID_0', 'ID_1']].values.flat)
        df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
        return df
    
    def pir2(df):
        unq = np.unique(df)
        mapping = pd.Series(np.arange(unq.size), unq)
        return df.stack().map(mapping).unstack() \
          .groupby(df.columns.tolist()).size().reset_index(name='count')
    

    我得到以下计时:

    %timeit root(df2)
    10 loops, best of 3: 101 ms per loop
    
    %timeit pir2(df2)
    1 loops, best of 3: 1.69 s per loop
    

    这篇关于如何有效地重新编码和计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆