什么是特征散列(散列技巧)? [英] What is feature hashing (hashing-trick)?
问题描述
谢谢。
$ b
导入熊猫作为pd
导入numpy为np
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year ':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
data = pd.DataFrame(data)
$ b def hash_col(df,col,N):
cols = [col +_+ str(i)for i in range(N)]
def xform x):tmp = [0,对于范围(N)中的i); tmp [hash(x)%N] = 1;返回pd.Series(tmp,index = cols)
df [cols] = df [col] .apply(xform)
return df.drop(col,axis = 1)
print hash_col(data,'state',4)
输出结果是
pop year state_0 state_1 state_2 state_3
0 1.5 2000 0 1 0 0
1 1.7 2001 0 1 0 0
2 3.6 2002 0 1 0 0
3 2.4 2001 0 0 0 1
4 2.9 2002 0 0 0 1
同样在系列级别上,您可以将
导入numpy作为np,os
导入sys,pandas作为pd
def hash_col(df,col,N):
df = df.replace('',np.nan)
cols = [col +_+ str(i)for i in range(N)]
tmp = [0 for i in range(N)]
tmp [hash(df。 (col))
return res.drop(col)
$ b = pd.Series(['new york',30,''],index = ['city','age','test'])
b = pd.Series( ['boston',30,''],index = ['city','age','test'])
print hash_col(a,'city',10)
print hash_col(b,'city',10)
这将适用于每个系列,列名将会假定为熊猫指数。它也用nan代替空白字符串,并且浮动一切。
年龄30
测试NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 0
city_6 0
city_7 1
city_8 0
city_9 0
dtype:object
年龄30
测试NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 1
city_6 0
city_7 0
city_8 0
city_9 0
dtype:object
但是,如果有一个词汇表,并且您只想要一个热点编码,您可以使用
import numpy as np
import pandas as pd,os
import scipy.sparse as sps
def hash_col(df,col,vocab):
(x):tmp = [0(对于范围内的我(len(vocab))]; tmp [vocab.index(x)] = 1;返回pd.Series(tmp,index = cols)
df [cols] = df [col] .apply(xform)
return df.drop(col,axis = 1)
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002] ,
'pop':[1.5,1.7,3.6,2.4,2.9]}
df = pd.DataFrame(data)
df2 = hash_col(df ,'state',['Ohio','Nevada'])
print sps.csr_matrix(df2)
这将给予
流行年份状态=俄亥俄州=内华达州
0 1.5 2000 1 0
1 1.7 2001 1 0
2 3.6 2002 1 0
3 2.4 2001 0 1
4 2.9 2002 0 1
我也加入了最终数据框的稀疏性。在增量设置中,我们可能事先没有遇到所有的值(但我们以某种方式获得了所有可能值的列表),可以使用上面的方法。增量ML方法在每个增量处需要相同数量的特征,因此一次热编码必须在每批产生相同数量的行。
I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don't understand how it really works. Can anyone explain this to me.Is there any python library available to do feature hashing?
Thank you.
On Pandas, you could use something like this:
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data = pd.DataFrame(data)
def hash_col(df, col, N):
cols = [col + "_" + str(i) for i in range(N)]
def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
df[cols] = df[col].apply(xform)
return df.drop(col,axis=1)
print hash_col(data, 'state',4)
The output would be
pop year state_0 state_1 state_2 state_3
0 1.5 2000 0 1 0 0
1 1.7 2001 0 1 0 0
2 3.6 2002 0 1 0 0
3 2.4 2001 0 0 0 1
4 2.9 2002 0 0 0 1
Also on Series level, you could
import numpy as np, os import sys, pandas as pd
def hash_col(df, col, N):
df = df.replace('',np.nan)
cols = [col + "_" + str(i) for i in range(N)]
tmp = [0 for i in range(N)]
tmp[hash(df.ix[col]) % N] = 1
res = df.append(pd.Series(tmp,index=cols))
return res.drop(col)
a = pd.Series(['new york',30,''],index=['city','age','test'])
b = pd.Series(['boston',30,''],index=['city','age','test'])
print hash_col(a,'city',10)
print hash_col(b,'city',10)
This will work per single Series, column name will be assumed to be a Pandas index. It also replaces blank strings with nan, and floats everything.
age 30
test NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 0
city_6 0
city_7 1
city_8 0
city_9 0
dtype: object
age 30
test NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 1
city_6 0
city_7 0
city_8 0
city_9 0
dtype: object
If, however, there is a vocabulary, and you simply want to one-hot-encode, you could use
import numpy as np
import pandas as pd, os
import scipy.sparse as sps
def hash_col(df, col, vocab):
cols = [col + "=" + str(v) for v in vocab]
def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
df[cols] = df[col].apply(xform)
return df.drop(col,axis=1)
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df2 = hash_col(df, 'state', ['Ohio','Nevada'])
print sps.csr_matrix(df2)
which will give
pop year state=Ohio state=Nevada
0 1.5 2000 1 0
1 1.7 2001 1 0
2 3.6 2002 1 0
3 2.4 2001 0 1
4 2.9 2002 0 1
I also added sparsification of the final dataframe as well. In incremental setting where we might not have encountered all values beforehand (but we somehow obtained the list of all possible values somehow), the approach above can be used. Incremental ML methods would need the same number of features at each increment, hence one-hot encoding must produce the same number of rows at each batch.
这篇关于什么是特征散列(散列技巧)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!