如何为KMeans向量化json数据? [英] How to vectorize json data for KMeans?

查看：167 发布时间：2020/4/26 10:23:26 apache-spark scikit-learn pyspark k-means

本文介绍了如何为KMeans向量化json数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有很多问题和选择，用户将要回答.它们的格式如下:

I have a number of questions and choices which users are going to answer. They have the format like this:

question_id, text, choices

对于每个用户，我将回答的问题和每个用户选择的选项存储为mongodb中的json:

And for each user I store the answered questions and selected choice by each user as a json in mongodb:

{user_id: "",  "question_answers" : [{"question_id": "choice_id", ..}] }

现在，我正尝试使用K-Means聚类和流式传输来根据他们选择的问题来查找最相似的用户，但是我需要将用户数据转换为一些矢量，例如Spark的文档中的示例

Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.

kmeans数据样本和我想要的输出:

kmeans data sample and my desired output:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

我已经尝试过使用scikit-learn的DictVectorizer，但它似乎无法正常工作.

I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.

我为每个question_choice组合创建了一个键，如下所示:

I created a key for each question_choice combination like this:

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)

然后，我尝试将用户的每个问题/选择对转换为以下形式:

And I try to transform each of my user's question/choice pairs into this:

v.transform({'question_1_choice_2': 1, ...})

我得到这样的结果:

[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

这是正确的方法吗?因为我每次都需要为所有选择和答案创建一个字典.有没有办法在Spark中做到这一点?

Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?

先谢谢了.抱歉，我是数据科学的新手.

Thanks in advance. Sorry I'm new to data science.

如何为KMeans向量化json数据? [英] How to vectorize json data for KMeans?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何为KMeans向量化json数据? [英] How to vectorize json data for KMeans?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭