如何为KMeans向量化json数据? [英] How to vectorize json data for KMeans?
问题描述
我有很多问题和选择,用户将要回答.它们的格式如下:
I have a number of questions and choices which users are going to answer. They have the format like this:
question_id, text, choices
对于每个用户,我将回答的问题和每个用户选择的选项存储为mongodb中的json:
And for each user I store the answered questions and selected choice by each user as a json in mongodb:
{user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] }
现在,我正尝试使用K-Means聚类和流式传输来根据他们选择的问题来查找最相似的用户,但是我需要将用户数据转换为一些矢量,例如Spark的文档中的示例
Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.
kmeans数据样本和我想要的输出:
kmeans data sample and my desired output:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
我已经尝试过使用scikit-learn的DictVectorizer,但它似乎无法正常工作.
I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.
我为每个question_choice组合创建了一个键,如下所示:
I created a key for each question_choice combination like this:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)
然后,我尝试将用户的每个问题/选择对转换为以下形式:
And I try to transform each of my user's question/choice pairs into this:
v.transform({'question_1_choice_2': 1, ...})
我得到这样的结果:
[[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
这是正确的方法吗?因为我每次都需要为所有选择和答案创建一个字典.有没有办法在Spark中做到这一点?
Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?
先谢谢了.抱歉,我是数据科学的新手.
Thanks in advance. Sorry I'm new to data science.
推荐答案
不要对分类数据使用K-Means.让我引用如何理解K-means的缺点通过 KevinKim :
Don't use K-Means with categorical data. Let me quote How to understand the drawbacks of K-means by KevinKim:
k均值假设每个属性(变量)的分布方差为球形;
k-means assume the variance of the distribution of each attribute (variable) is spherical;
所有变量具有相同的方差;
all variables have the same variance;
所有k个聚类的先验概率是相同的,即每个聚类具有大约相等数量的观察值;如果违反了这三个假设中的任何一个,则k均值将失败.
the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.
使用编码的分类数据,几乎肯定会违反前两个假设.
With encoded categorical data the first two assumptions are almost sure to violated.
有关进一步的讨论,请参见 K均值聚类不是免费的午餐 大卫·罗宾逊.
For further discussion see K-means clustering is not a free lunch by David Robinson.
我正在尝试使用K-Means聚类和流式传输来根据他们对问题的选择来查找最相似的用户
I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions
对于相似性搜索,请对近似联接使用MinHashLSH
:
For similarity searches use MinHashLSH
with approximate joins:
您必须StringIndex
和OneHotEncode
的所有变量,如以下答案所示:
You'll have to StringIndex
and OneHotEncode
all variables for that as shown in the following answers :
另请参见评论 href ="https://stackoverflow.com/users/790474/henrikstroem"> henrikstroem .
See also the comment by henrikstroem.
这篇关于如何为KMeans向量化json数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!