在机器学习中使用Scikit对邮政编码进行哈希处理 [英] Feature Hashing of zip codes with Scikit in machine learning

查看:67
本文介绍了在机器学习中使用Scikit对邮政编码进行哈希处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究机器学习问题,我的数据集中有很多邮政编码(〜8k唯一值).因此,我决定将这些值散列到较小的特征空间中,而不是使用OHE之类的方法.

I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.

我遇到的问题是哈希中唯一行的比例很小(20%),这从我的理解上基本上意味着我有很多重复项/冲突.即使我将哈希表中的功能增加到200个左右,我的唯一值也从未超过20%.这对我来说没有意义,因为随着我的哈希中的列数不断增加,应该可以有更多独特的组合

The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible

我使用以下代码将scikit的邮政编码进行哈希处理,并根据最后一个数组中的唯一值计算冲突:

I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:

from sklearn.feature_extraction import FeatureHasher

D = pd.unique(Daten["PLZ"])

print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)

h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()

print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)

unq = np.unique(f, axis=0)

print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)

我收到的输出:

Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550'] 
Zipcode Shape: (8158,)
Feature Array:
 [[ 2.  1.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]] 
Feature Shape: (8158, 32)
Unique values:
 [[ 0. -3.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 ...
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]] 
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595

我们非常感谢您的帮助和见解.

Any help and insights are greatly appreciated.

推荐答案

在转换后的数据中,第一个 2 应该是一个线索.我想您还会发现许多列都是全零的.

That very first 2 in the transformed data should be a clue. I think you'll also find that many of the columns are all-zero.

来自文档

每个样本都必须是可迭代的...

Each sample must be iterable...

因此,哈希器将邮政编码'86916'视为元素 8 6 collection 9 1 6 ,您只会得到十个非零列(第一列大概是 6 ,它会出现两次(如开头所述).您应该能够通过将输入重塑为二维来纠正此问题.

So the hasher is treating the zip code '86916' as the collection of elements 8, 6, 9, 1, 6, and you only get ten nonzero columns (the first column presumably being the 6, which appears twice, as noted at the beginning). You should be able to rectify this by reshaping the input to be 2-dimensional.

这篇关于在机器学习中使用Scikit对邮政编码进行哈希处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆