如何将文本数据聚类成多列? [英] How can I cluster text data with multiple columns?
问题描述
我想做一个k表示用具有标题",类型",评论"和摘要"列的书本文本数据进行聚类.
I'd like to do a k means clustering with book text data that has 'title', 'genre', 'review', and 'synopsis' columns.
我想使用标题"作为指示符或主键进行聚类,但是我不确定如何为此使用多列.
I want to use the 'title' as the indicator, or primary key, for clustering, but I'm not sure how to use multiple columns for this.
我知道我首先必须对数据进行矢量化,但是矢量化需要输入系列数据,而不是数据帧值.所以在这里,我又一次不知道如何使用所有列.
I know that I first have to vectorize the data, but vectorization takes in series data and not dataframe values; so here, again, I don't know how to use all the columns as I want to.
推荐答案
您可以分别矢量化各列并连接结果.
You can vectorize each column separately and concatenate the results.
只需确保进行稀疏连接即可.
Just make sure you do a sparse concatenation.
但是,用k均值对文本进行聚类根本无法正常工作. K均值对异常值和噪声非常敏感,并且测试中充满了噪声. k均值(k信号和i.i.d.高斯误差)的基本假设不适用于文本.祝你好运...
However, clustering text with k-means is not at all working well. K-means is very sensitive to outliers and noise, and test is full of noise. Fundamental assumptions of k-means (k signals, and i.i.d. Gaussian error) do not hold for text. Good luck...
这篇关于如何将文本数据聚类成多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!