在机器学习中使用Scikit对邮政编码进行哈希处理 [英] Feature Hashing of zip codes with Scikit in machine learning

查看：67 发布时间：2021/5/6 20:31:53 machine-learning scikit-learn hash feature-extraction

本文介绍了在机器学习中使用Scikit对邮政编码进行哈希处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究机器学习问题，我的数据集中有很多邮政编码(〜8k唯一值).因此，我决定将这些值散列到较小的特征空间中，而不是使用OHE之类的方法.

I am working on a machine learning problem, where I have a lot of zipcodes (~8k unique values) in my data set. Thus I decided to hash the values into a smaller feature space instead of using something like OHE.

我遇到的问题是哈希中唯一行的比例很小(20％)，这从我的理解上基本上意味着我有很多重复项/冲突.即使我将哈希表中的功能增加到200个左右，我的唯一值也从未超过20％.这对我来说没有意义，因为随着我的哈希中的列数不断增加，应该可以有更多独特的组合

The problem I encountered was a very small percentage (20%) of unique rows in my hash, which basically means from my understanding, that I have a lot of duplicates/collisions. Even though I increased the features in my hash table to ~200, I never got more than 20% of unique values. This does not make sense to me, since with a growing number of columns in my hash, more unique combinations should be possible

我使用以下代码将scikit的邮政编码进行哈希处理，并根据最后一个数组中的唯一值计算冲突:

I used the following code to hash my zip codes with scikit and calculate the collisions based on unique vales in the last array:

from sklearn.feature_extraction import FeatureHasher

D = pd.unique(Daten["PLZ"])

print("Zipcode Data:", D,"\nZipcode Shape:", D.shape)

h = FeatureHasher(n_features=2**5, input_type="string")
f = h.transform(D)
f = f.toarray()

print("Feature Array:\n",f ,"\nFeature Shape:", f.shape)

unq = np.unique(f, axis=0)

print("Unique values:\n",unq,"\nUnique Shape:",unq.shape)
print("Percentage of unique values in hash array:",unq.shape[0]/f.shape[0]*100)

我收到的输出:

Zipcode Data: ['86916' '01445' '37671' ... '82387' '83565' '83550'] 
Zipcode Shape: (8158,)
Feature Array:
 [[ 2.  1.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 [ 0. -1.  0. ...  0.  0.  0.]] 
Feature Shape: (8158, 32)
Unique values:
 [[ 0. -3.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 [ 0. -2.  0. ...  0.  0.  0.]
 ...
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]
 [ 4.  0.  0. ...  0.  0.  0.]] 
Unique Shape: (1707, 32)
Percentage of unique values in hash array: 20.9242461387595

我们非常感谢您的帮助和见解.

Any help and insights are greatly appreciated.

在机器学习中使用Scikit对邮政编码进行哈希处理 [英] Feature Hashing of zip codes with Scikit in machine learning

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

在机器学习中使用Scikit对邮政编码进行哈希处理 [英] Feature Hashing of zip codes with Scikit in machine learning

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭