如何使用Spark的MLLib对推文进行矢量化处理? [英] How can I vectorize Tweets using Spark's MLLib?

查看：183 发布时间：2020/4/26 10:26:20 apache-spark vector twitter k-means apache-spark-mllib

本文介绍了如何使用Spark的MLLib对推文进行矢量化处理?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将tweet转换为用于机器学习的向量，以便我可以使用Spark的K-Means聚类根据内容对它们进行分类.例如，与亚马逊相关的所有推文都归为一类.

I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category.

我曾尝试过将推文拆分为单词，然后使用HashingTF创建矢量，但这并不是很成功.

I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful.

还有其他矢量化推文的方法吗?

Are there any other ways to vectorize tweets?

推荐答案

您可以尝试以下管道:

首先，标记输入的Tweet(位于列text中).基本上，它将创建一个新列rawWords作为从原始文本中提取的单词列表.为了获得这些单词，它用字母数字单词(.setPattern("\\w+").setGaps(false))

First, tokenize the input Tweet (located in the column text). basically, it creates a new column rawWords as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false))

val tokenizer = new RegexTokenizer()
 .setInputCol("text")
 .setOutputCol("rawWords")
 .setPattern("\\w+")
 .setGaps(false)

第二，您可以考虑删除停用词以删除文本中不重要的词，例如 a ， the ， of ，等

Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.

val stopWordsRemover = new StopWordsRemover()
 .setInputCol("rawWords")
 .setOutputCol("words")

现在是时候对words列进行矢量化了.在此示例中，我使用了非常基本的CountVectorizer.还有许多其他内容，例如TF-ID Vectorizer.您可以在此处找到更多信息.

Now it's time to vectorize the wordscolumn. In this example I'm using the CountVectorizerwhich is quite basic. There are many others such as the TF-ID Vectorizer. You can find more information here.

我已经配置了CountVectorizer，以便它创建一个包含10,000个单词的词汇表，每个单词在所有文档中至少出现5次，在每个文档中最少出现1次.

I've configured the CountVectorizerso that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.

val countVectorizer = new CountVectorizer()
 .setInputCol("words")
 .setOutputCol("features")
 .setVocabSize(10000)
 .setMinDF(5.0)
 .setMinTF(1.0)

最后，只需创建管道，然后通过传递数据集来拟合和转换由管道生成的模型.

Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.

val transformPipeline = new Pipeline()
 .setStages(Array(
   tokenizer,
   stopWordsRemover,
   countVectorizer))

transformPipeline.fit(training).transform(test)

希望有帮助.

这篇关于如何使用Spark的MLLib对推文进行矢量化处理?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Spark的MLLib对推文进行矢量化处理? [英] How can I vectorize Tweets using Spark's MLLib?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Spark的MLLib对推文进行矢量化处理? [英] How can I vectorize Tweets using Spark&#39;s MLLib?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何使用Spark的MLLib对推文进行矢量化处理? [英] How can I vectorize Tweets using Spark's MLLib?

登录关闭