如何使用 Spark 创建用于文本分类的 TF-IDF? [英] How can I create a TF-IDF for Text Classification using Spark?

查看:39
本文介绍了如何使用 Spark 创建用于文本分类的 TF-IDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以下格式的 CSV 文件:

I have a CSV file with the following format :

product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]

product_idX 是一个整数,product_titleX 是一个字符串,例如:

The product_idX is a integer and the product_titleX is a String, example :

453478692, Apple iPhone 4 8Go

我正在尝试从我的文件创建 TF-IDF,以便我可以将它用于 MLlib 中的朴素贝叶斯分类器.

I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.

到目前为止,我正在使用 Spark for Scala 并使用我在官方页面和 Berkley AmpCamp 上找到的教程34.

I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.

所以我正在阅读文件:

val file = sc.textFile("offers.csv")

然后我将它映射到元组 RDD[Array[String]]

Then I'm mapping it in tuples RDD[Array[String]]

val tuples = file.map(line => line.split(",")).cache

然后我将元组转换成对 RDD[(Int, String)]

and after I'm transforming the tuples into pairs RDD[(Int, String)]

val pairs = tuples.(line => (line(0),line(1)))

但我被困在这里,我不知道如何从中创建 Vector 以将其转换为 TFIDF.

But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.

谢谢

推荐答案

为了自己做这件事(使用 pyspark),我首先从语料库中创建了两个数据结构.第一个是

To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of

document_id, [token_ids]

第二个是像

token_id, [document_ids]

我将分别调用这些语料库和 inv_index.

I'll call those corpus and inv_index respectively.

为了得到 tf,我们需要计算每个文档中每个标记出现的次数.所以

To get tf we need to count the number of occurrences of each token in each document. So

from collections import Counter
def wc_per_row(row):
    cnt = Counter()
    for word in row:
        cnt[word] += 1
    return cnt.items() 

tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))

df 只是每个术语的倒排索引的长度.由此我们可以计算idf.

The df is simply the length of each term's inverted index. From that we can calculate the idf.

df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()

# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df. 
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()

现在我们只需要对 term_id 进行连接:

Now we just have to do a join on the term_id:

def calc_tfidf(tf_tuples, idf_tuples):
    return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
        (k2, v2) in idf_tuples if k1 == k2]

tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))

不过,这并不是一个特别高效的解决方案.调用 collect 将 idf 带入驱动程序以便它可用于连接似乎是错误的做法.

This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.

当然,它需要首先标记并创建从词汇表中的每个 uniq 标记到某个 token_id 的映射.

And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.

如果有人可以改进这一点,我很感兴趣.

If anyone can improve on this, I'm very interested.

这篇关于如何使用 Spark 创建用于文本分类的 TF-IDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆