具有大型数据集的短文本聚类-用户配置文件 [英] short text clustering with large dataset - user profiling

查看:136
本文介绍了具有大型数据集的短文本聚类-用户配置文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我解释一下我想做什么:

Let me explain what I want to do:

输入

一个 csv文件,其中数百万行包含以下各行:用户 id 和一个包含该用户使用的关键字列表的字符串,以空格分隔.第二个字段的格式(字符串)不是那么重要,我可以根据自己的需要进行更改,例如添加这些关键字的计数.数据来自Twitter数据库:用户是Twitter用户,关键字是从其推文中提取的有意义"字词(多么不重要).

A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not important).

示例行

目前,csv的单行内容如下:
(用户ID,关键字)

This is currently what a single row of the csv looks like:
(user id, keywords)

"1627498372", " play house business card"  

目标

鉴于我要输入的信息,我希望根据他们在 java 中使用的关键字来聚类用户,以便不同的集群以某种方式代表兴趣相似的用户,因此关键字的用法相似,无需使用机器学习技术,自然语言处理或并行化技术(如MapReduce).我在互联网上搜索了很多聚类算法库,例如BIRCH,BFR,CURE,ROCK,CLARANS等,但是似乎没有一个适合我的需求,因为它们要么用于空间点,要么使用机器学习模型,否则他们将无法处理大型数据集.

Given the input I want to cluster users based on the keywords they use in java so that the different clusters represent somehow users with similar interests, therefore similar keywords usage, without using machine learning techniques, natural language processing or parallelization techniques like MapReduce. I have searched a lot of clustering algorithms libraries on the internet like BIRCH, BFR, CURE, ROCK, CLARANS, etc, but no one of them seems to suit my needs, because either they are for spacial points, or they uses machine learning models, or they struggle with large datasets.

因此,我在这里问您是否知道此类用于文本的聚类算法名称/库/可合理实施的伪代码(最好是jars),或者可以轻松修改以与字符串一起使用.

So I am here to ask you if you know of such clustering algorithm names/libraries/reasonably implementable pseudocode (preferably jars) for texts or that can be easily modified to work with strings.

希望一切都清楚.

更新

我在等待响应时,遇到了scikitlearn python库,尤其是minibatchkmeans库,我现在正在尝试使用它...因此,作为更新,如果您在 python 中找到了一些内容,随时分享.

While I was waiting responses I came upon the scikitlearn library for python, especially minibatchkmeans, I am trying something with it for now... so just as an update, if you find something in python, feel free to share.

推荐答案

您应该考虑频繁的项目集挖掘,而不是使用群集(群集多少?不适合任何群集的用户呢?)查找关键字的常用组合.

Instead of clustering (how many clusters? What about users that do not fit any cluster?) you should rather consider frequent itemset mining to find popluar combinations of keywords.

这篇关于具有大型数据集的短文本聚类-用户配置文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆