如何将文本数据聚类成多列? [英] How can I cluster text data with multiple columns?

查看:68
本文介绍了如何将文本数据聚类成多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一个k表示用具有标题",类型",评论"和摘要"列的书本文本数据进行聚类.

I'd like to do a k means clustering with book text data that has 'title', 'genre', 'review', and 'synopsis' columns.

我想使用标题"作为指示符或主键进行聚类,但是我不确定如何为此使用多列.

I want to use the 'title' as the indicator, or primary key, for clustering, but I'm not sure how to use multiple columns for this.

我知道我首先必须对数据进行矢量化,但是矢量化需要输入系列数据,而不是数据帧值.所以在这里,我又一次不知道如何使用所有列.

I know that I first have to vectorize the data, but vectorization takes in series data and not dataframe values; so here, again, I don't know how to use all the columns as I want to.

推荐答案

您可以分别矢量化各列并连接结果.

You can vectorize each column separately and concatenate the results.

只需确保进行稀疏连接即可.

Just make sure you do a sparse concatenation.

但是,用k均值对文本进行聚类根本无法正常工作. K均值对异常值和噪声非常敏感,并且测试中充满了噪声. k均值(k信号和i.i.d.高斯误差)的基本假设不适用于文本.祝你好运...

However, clustering text with k-means is not at all working well. K-means is very sensitive to outliers and noise, and test is full of noise. Fundamental assumptions of k-means (k signals, and i.i.d. Gaussian error) do not hold for text. Good luck...

这篇关于如何将文本数据聚类成多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆