使用HBase作为数据源来计算文档的TF-IDF [英] Calculate TF-IDF of documents using HBase as the datasource

查看:103
本文介绍了使用HBase作为数据源来计算文档的TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算存储在HBase中的文档的TF(Term Frequency)和IDF(Inverse Document Frequency)。我还想保存在HBase表中计算TF,并将计算的IDF保存在另一个HBase表中。



您可以引导我完成吗?

我从 Mahout 0.4 看过 BayesTfIdfDriver ,但是我没有领先。

解决方案

解决方案的大纲非常简单:


  1. 对您的hbase表进行统计,为每个单词存储术语频率和文档频率

  2. 在缩小阶段存储每个单词的术语频率和文档频率
  3. >
  4. 鉴于文档数量,再次扫描聚合结果并根据文档频率计算IDF。

TF-IDF上的维基百科页面是一个很好的参考记住公式的详细信息: http://en.wikipedia.org/wiki/Tf * idf


I want to calculate the TF (Term Frequency) and the IDF (Inverse Document Frequency) of documents that are stored in HBase.

I also want to save the calculated TF in a HBase table, also save the calculated IDF in another HBase table.

Can you guide me through?

I have looked at BayesTfIdfDriver from Mahout 0.4 but I am not getting a head start.

解决方案

The outline of a solution is pretty straight forward:

  1. do a word count over your hbase tables, storing both term frequency and document frequency for each word
  2. in your reduce phase aggregate the term frequency and document frequency for each word
  3. Given a count of your documents, scan through your aggregated results one more time and calculate the IDF based off of the document frequency.

The wikipedia page on TF-IDF is a good reference to remember the details of the formula: http://en.wikipedia.org/wiki/Tf*idf

这篇关于使用HBase作为数据源来计算文档的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆