如何使用Spark的MLLib对推文进行矢量化处理? [英] How can I vectorize Tweets using Spark's MLLib?
问题描述
我想将tweet转换为用于机器学习的向量,以便我可以使用Spark的K-Means聚类根据内容对它们进行分类.例如,与亚马逊相关的所有推文都归为一类.
I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category.
我曾尝试过将推文拆分为单词,然后使用HashingTF创建矢量,但这并不是很成功.
I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful.
还有其他矢量化推文的方法吗?
Are there any other ways to vectorize tweets?
推荐答案
您可以尝试以下管道:
首先,标记输入的Tweet(位于列text
中).基本上,它将创建一个新列rawWords
作为从原始文本中提取的单词列表.为了获得这些单词,它用字母数字单词(.setPattern("\\w+").setGaps(false)
)
First, tokenize the input Tweet (located in the column text
). basically, it creates a new column rawWords
as a list of words taken from the original text. To get these words, it splits the input text by alphanumeric words (.setPattern("\\w+").setGaps(false)
)
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawWords")
.setPattern("\\w+")
.setGaps(false)
第二,您可以考虑删除停用词以删除文本中不重要的词,例如 a , the , of ,等
Secondly, you may consider remove the stop words to remove less significant words in the text, such as a, the, of, etc.
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawWords")
.setOutputCol("words")
现在是时候对words
列进行矢量化了.在此示例中,我使用了非常基本的CountVectorizer
.还有许多其他内容,例如TF-ID Vectorizer
.您可以在此处找到更多信息.
Now it's time to vectorize the words
column. In this example I'm using the CountVectorizer
which is quite basic. There are many others such as the TF-ID Vectorizer
. You can find more information here.
我已经配置了CountVectorizer
,以便它创建一个包含10,000个单词的词汇表,每个单词在所有文档中至少出现5次,在每个文档中最少出现1次.
I've configured the CountVectorizer
so that it creates a vocabulary with 10,000 words, each word appearing a minimum of 5 times across all document, and a minimum of 1 on each document.
val countVectorizer = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5.0)
.setMinTF(1.0)
最后,只需创建管道,然后通过传递数据集来拟合和转换由管道生成的模型.
Finally, just create the pipeline, and fit and transform the model generated by the pipeline by passing the dataset.
val transformPipeline = new Pipeline()
.setStages(Array(
tokenizer,
stopWordsRemover,
countVectorizer))
transformPipeline.fit(training).transform(test)
希望有帮助.
这篇关于如何使用Spark的MLLib对推文进行矢量化处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!