什么是特征散列(散列技巧)? [英] What is feature hashing (hashing-trick)?

查看:145
本文介绍了什么是特征散列(散列技巧)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道功能散列(散列技巧)用于降低维度和处理位矢量的稀疏性,但我不明白它是如何工作的。任何人都可以向我解释这一点。是否有任何python库可以做功能哈希?



谢谢。

$ p
$ b

 导入熊猫作为pd 
导入numpy为np

data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year ':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}

data = pd.DataFrame(data)
$ b def hash_col(df,col,N):
cols = [col +_+ str(i)for i in range(N)]
def xform x):tmp = [0,对于范围(N)中的i); tmp [hash(x)%N] = 1;返回pd.Series(tmp,index = cols)
df [cols] = df [col] .apply(xform)
return df.drop(col,axis = 1)

print hash_col(data,'state',4)

输出结果是

  pop year state_0 state_1 state_2 state_3 
0 1.5 2000 0 1 0 0
1 1.7 2001 0 1 0 0
2 3.6 2002 0 1 0 0
3 2.4 2001 0 0 0 1
4 2.9 2002 0 0 0 1

同样在系列级别上,您可以将

导入numpy作为np,os
导入sys,pandas作为pd

p>

  def hash_col(df,col,N):
df = df.replace('',np.nan)
cols = [col +_+ str(i)for i in range(N)]
tmp = [0 for i in range(N)]
tmp [hash(df。 (col))
return res.drop(col)
$ b = pd.Series(['new york',30,''],index = ['city','age','test'])
b = pd.Series( ['boston',30,''],index = ['city','age','test'])

print hash_col(a,'city',10)
print hash_col(b,'city',10)

这将适用于每个系列,列名将会假定为熊猫指数。它也用nan代替空白字符串,并且浮动一切。

 年龄30 
测试NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 0
city_6 0
city_7 1
city_8 0
city_9 0
dtype:object
年龄30
测试NaN
city_0 0
city_1 0
city_2 0
city_3 0
city_4 0
city_5 1
city_6 0
city_7 0
city_8 0
city_9 0
dtype:object

但是,如果有一个词汇表,并且您只想要一个热点编码,您可以使用

  import numpy as np 
import pandas as pd,os
import scipy.sparse as sps

def hash_col(df,col,vocab):
(x):tmp = [0(对于范围内的我(len(vocab))]; tmp [vocab.index(x)] = 1;返回pd.Series(tmp,index = cols)
df [cols] = df [col] .apply(xform)
return df.drop(col,axis = 1)

data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002] ,
'pop':[1.5,1.7,3.6,2.4,2.9]}

df = pd.DataFrame(data)

df2 = hash_col(df ,'state',['Ohio','Nevada'])

print sps.csr_matrix(df2)

这将给予

 流行年份状态=俄亥俄州=内华达州
0 1.5 2000 1 0
1 1.7 2001 1 0
2 3.6 2002 1 0
3 2.4 2001 0 1
4 2.9 2002 0 1

我也加入了最终数据框的稀疏性。在增量设置中,我们可能事先没有遇到所有的值(但我们以某种方式获得了所有可能值的列表),可以使用上面的方法。增量ML方法在每个增量处需要相同数量的特征,因此一次热编码必须在每批产生相同数量的行。


I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don't understand how it really works. Can anyone explain this to me.Is there any python library available to do feature hashing?

Thank you.

解决方案

On Pandas, you could use something like this:

import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

data = pd.DataFrame(data)

def hash_col(df, col, N):
    cols = [col + "_" + str(i) for i in range(N)]
    def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

print hash_col(data, 'state',4)

The output would be

   pop  year  state_0  state_1  state_2  state_3
0  1.5  2000        0        1        0        0
1  1.7  2001        0        1        0        0
2  3.6  2002        0        1        0        0
3  2.4  2001        0        0        0        1
4  2.9  2002        0        0        0        1

Also on Series level, you could

import numpy as np, os import sys, pandas as pd

def hash_col(df, col, N):
    df = df.replace('',np.nan)
    cols = [col + "_" + str(i) for i in range(N)]
    tmp = [0 for i in range(N)]
    tmp[hash(df.ix[col]) % N] = 1
    res = df.append(pd.Series(tmp,index=cols))
    return res.drop(col)

a = pd.Series(['new york',30,''],index=['city','age','test'])
b = pd.Series(['boston',30,''],index=['city','age','test'])

print hash_col(a,'city',10)
print hash_col(b,'city',10)

This will work per single Series, column name will be assumed to be a Pandas index. It also replaces blank strings with nan, and floats everything.

age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      0
city_6      0
city_7      1
city_8      0
city_9      0
dtype: object
age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      1
city_6      0
city_7      0
city_8      0
city_9      0
dtype: object

If, however, there is a vocabulary, and you simply want to one-hot-encode, you could use

import numpy as np
import pandas as pd, os
import scipy.sparse as sps

def hash_col(df, col, vocab):
    cols = [col + "=" + str(v) for v in vocab]
    def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)

df2 = hash_col(df, 'state', ['Ohio','Nevada'])

print sps.csr_matrix(df2)

which will give

   pop  year  state=Ohio  state=Nevada
0  1.5  2000           1             0
1  1.7  2001           1             0
2  3.6  2002           1             0
3  2.4  2001           0             1
4  2.9  2002           0             1

I also added sparsification of the final dataframe as well. In incremental setting where we might not have encountered all values beforehand (but we somehow obtained the list of all possible values somehow), the approach above can be used. Incremental ML methods would need the same number of features at each increment, hence one-hot encoding must produce the same number of rows at each batch.

这篇关于什么是特征散列(散列技巧)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆