如何有效地重新编码和计数 [英] How to recode and count efficiently
问题描述
我有一个很大的csv,这种形式每行三个字符串:
I have a large csv with three strings per row in this form:
a,c,d
c,a,e
f,g,f
a,c,b
c,a,d
b,f,s
c,a,c
我阅读了前两列中的字符串,将其重新编码为整数,然后删除重复项,计数每行的副本数,如下所示:
I read in the first two columns recode the strings to integers and then remove duplicates counting how many copies of each row there were as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
letters = set(df.values.flat)
df.replace(to_replace=letters, value=range(len(letters)), inplace=True)
df1 = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
print df1
这给出了:
ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1
这正是我所需要的.
尽管我的数据很大,但我还是要进行两项改进.
However as my data is large I would like to make two improvements.
- 我该如何进行groupby,然后重新编码,而不是相反?问题是我不能执行
df1[['ID_0','ID_0']].replace(to_replace=letters, value=range(len(letters)), inplace = True)
.这给出了错误
正在尝试从DataFrame的切片副本上设置一个值"
"A value is trying to be set on a copy of a slice from a DataFrame"
推荐答案
我喜欢使用 sklearn.preprocessing.LabelEncoder
进行字母到数字的转换:
I like to use sklearn.preprocessing.LabelEncoder
to do the letter to digit conversion:
from sklearn.preprocessing import LabelEncoder
# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
结果输出:
ID_0 ID_1 count
0 0 2 2
1 1 3 1
2 2 0 3
3 3 4 1
如果要在以后转换回字母,可以使用le.inverse_transform
:
If you want to convert back to letters at a later point in time, you can use le.inverse_transform
:
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.inverse_transform)
哪个会按预期映射回去:
Which maps back as expected:
ID_0 ID_1 count
0 a c 2
1 b f 1
2 c a 3
3 f g 1
如果只想知道哪个数字对应哪个字母,可以查看le.classes_
属性.这将为您提供一个字母数组,该字母数组由它编码为的数字索引:
If you just want to know which digit corresponds to which letter, you can look at the le.classes_
attribute. This will give you an array of letters, which is indexed by the digit it encodes to:
le.classes_
['a' 'b' 'c' 'f' 'g']
要获得更直观的表示,可以将其投射为系列:
For a more visual representation, you can cast as a Series:
pd.Series(le.classes_)
0 a
1 b
2 c
3 f
4 g
时间
使用更大版本的示例数据和以下设置:
Using a larger version of the sample data and the following setup:
df2 = pd.concat([df]*10**5, ignore_index=True)
def root(df):
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
return df
def pir2(df):
unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)
return df.stack().map(mapping).unstack() \
.groupby(df.columns.tolist()).size().reset_index(name='count')
我得到以下计时:
%timeit root(df2)
10 loops, best of 3: 101 ms per loop
%timeit pir2(df2)
1 loops, best of 3: 1.69 s per loop
这篇关于如何有效地重新编码和计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!