在机器学习中使用Scikit对邮政编码进行哈希处理 [英] Feature Hashing of zip codes with Scikit in machine learning
问题描述
我正在研究机器学习问题,我的数据集中有很多邮政编码(〜8k唯一值).因此,我决定将这些值散列到较小的特征空间中,而不是使用OHE之类的方法.
I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.
我遇到的问题是哈希中唯一行的比例很小(20%),这从我的理解上基本上意味着我有很多重复项/冲突.即使我将哈希表中的功能增加到200个左右,我的唯一值也从未超过20%.这对我来说没有意义,因为随着我的哈希中的列数不断增加,应该可以有更多独特的组合
The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible
我使用以下代码将scikit的邮政编码进行哈希处理,并根据最后一个数组中的唯一值计算冲突:
I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:
from sklearn.feature_extraction import FeatureHasher
D = pd.unique(Daten["PLZ"])
print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)
h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()
print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)
unq = np.unique(f, axis=0)
print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)
我收到的输出:
Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550']
Zipcode Shape: (8158,)
Feature Array:
[[ 2. 1. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
...
[ 0. 0. 0. ... 0. 0. 0.]
[ 1. 0. 0. ... 0. 0. 0.]
[ 0. -1. 0. ... 0. 0. 0.]]
Feature Shape: (8158, 32)
Unique values:
[[ 0. -3. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
[ 0. -2. 0. ... 0. 0. 0.]
...
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]
[ 4. 0. 0. ... 0. 0. 0.]]
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595
我们非常感谢您的帮助和见解.
Any help and insights are greatly appreciated.
推荐答案
在转换后的数据中,第一个 2
应该是一个线索.我想您还会发现许多列都是全零的.
That very first 2
in the transformed data should be a clue. I think you'll also find that many of the columns are all-zero.
来自文档,
每个样本都必须是可迭代的...
Each sample must be iterable...
因此,哈希器将邮政编码'86916'
视为元素 8
, 6
的 collection , 9
, 1
, 6
,您只会得到十个非零列(第一列大概是 6
,它会出现两次(如开头所述).您应该能够通过将输入重塑为二维来纠正此问题.
So the hasher is treating the zip code '86916'
as the collection of elements 8
, 6
, 9
, 1
, 6
, and you only get ten nonzero columns (the first column presumably being the 6
, which appears twice, as noted at the beginning). You should be able to rectify this by reshaping the input to be 2-dimensional.
这篇关于在机器学习中使用Scikit对邮政编码进行哈希处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!