如何为KMeans向量化json数据? [英] How to vectorize json data for KMeans?

查看:167
本文介绍了如何为KMeans向量化json数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多问题和选择,用户将要回答.它们的格式如下:

I have a number of questions and choices which users are going to answer. They have the format like this:

question_id, text, choices

对于每个用户,我将回答的问题和每个用户选择的选项存储为mongodb中的json:

And for each user I store the answered questions and selected choice by each user as a json in mongodb:

{user_id: "",  "question_answers" : [{"question_id": "choice_id", ..}] }

现在,我正尝试使用K-Means聚类和流式传输来根据他们选择的问题来查找最相似的用户,但是我需要将用户数据转换为一些矢量,例如Spark的文档中的示例

Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.

kmeans数据样本和我想要的输出:

kmeans data sample and my desired output:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

我已经尝试过使用scikit-learn的DictVectorizer,但它似乎无法正常工作.

I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.

我为每个question_choice组合创建了一个键,如下所示:

I created a key for each question_choice combination like this:

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)

然后,我尝试将用户的每个问题/选择对转换为以下形式:

And I try to transform each of my user's question/choice pairs into this:

v.transform({'question_1_choice_2': 1, ...})

我得到这样的结果:

[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

这是正确的方法吗?因为我每次都需要为所有选择和答案创建一个字典.有没有办法在Spark中做到这一点?

Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?

先谢谢了.抱歉,我是数据科学的新手.

Thanks in advance. Sorry I'm new to data science.

推荐答案

不要对分类数据使用K-Means.让我引用如何理解K-means的缺点通过 KevinKim :

Don't use K-Means with categorical data. Let me quote How to understand the drawbacks of K-means by KevinKim:

  • k均值假设每个属性(变量)的分布方差为球形;

  • k-means assume the variance of the distribution of each attribute (variable) is spherical;

所有变量具有相同的方差;

all variables have the same variance;

所有k个聚类的先验概率是相同的,即每个聚类具有大约相等数量的观察值;如果违反了这三个假设中的任何一个,则k均值将失败.

the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

使用编码的分类数据,几乎肯定会违反前两个假设.

With encoded categorical data the first two assumptions are almost sure to violated.

有关进一步的讨论,请参见 K均值聚类不是免费的午餐 大卫·罗宾逊.

For further discussion see K-means clustering is not a free lunch by David Robinson.

我正在尝试使用K-Means聚类和流式传输来根据他们对问题的选择来查找最相似的用户

I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions

对于相似性搜索,请对近似联接使用MinHashLSH:

For similarity searches use MinHashLSH with approximate joins:

您必须StringIndexOneHotEncode的所有变量,如以下答案所示:

You'll have to StringIndex and OneHotEncode all variables for that as shown in the following answers :

将数据框设置为randomForest pyspark

另请参见评论 href ="https://stackoverflow.com/users/790474/henrikstroem"> henrikstroem .

See also the comment by henrikstroem.

这篇关于如何为KMeans向量化json数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆