java中Tf Idf的任何教程或代码 [英] Any tutorial or code for Tf Idf in java

查看:97
本文介绍了java中Tf Idf的任何教程或代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个可以计算tf-idf计算的简单java类。我想对2个文件进行相似性测试。我找到了很多使用tf-idf类的BIG API。我不想使用大的jar文件,只是为了做我的简单测试。请帮忙 !
或者如果有人可以告诉我如何找到TF?和以色列国防军?我将计算结果:)
OR
如果你能告诉我一些很好的java教程。
请不要告诉我看谷歌,我已经做了3天,找不到任何东西:(
请不要把我推荐给Lucene :(

I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please also do not refer me to Lucene :(

推荐答案

术语频率是术语在特定文档中出现次数的平方根。

Term Frequency is the square root of the number of times a term occurs in a particular document.

逆文档频率是(日志(文档总数除以包含该术语的文档数))加上一个术语出现零次的情况 - 如果是,显然不要试图除以零。

Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.

如果答案不清楚,每个文件每个学期有一个TF,每个学期有一个IDF。

If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.

然后TF-IDF(术语,文档)= TF(术语,文档)* IDF(术语)

And then TF-IDF(term, document) = TF(term, document) * IDF(term)

最后,使用向量空间比较文档的模型,其中每个术语是一个新维度,指向该维度的向量部分的长度是TF-IDF计算。每个文档都是一个向量,所以计算两个向量,然后计算它们之间的距离。

Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.

所以要在Java中执行此操作,使用FileReader或其他内容一次读取一行文件,并拆分为空格或您想要使用的任何其他分隔符 - 每个单词都是一个术语。计算每个术语在每个文件中出现的次数,以及每个术语出现的文件数。然后,您拥有完成上述计算所需的一切。

So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.

和因为我没有别的事可做,所以我查了一下矢量距离公式。你去吧:

And since I have nothing else to do, I looked up the vector distance formula. Here you go:

D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)

为此,x1是TF-IDF for文档1中的术语x。

For this purpose, x1 is the TF-IDF for term x in document 1.

编辑:回答您关于如何计算文档中单词的问题:

in response to your question about how to count the words in a document:


  1. 使用读取器逐行读取文件,例如 new BufferedReader(new FileReader(filename)) - 您可以调用 BufferedReader.readLine()在while循环中,每次都检查null。

  2. 对于每一行,调用行.split(\\\\) - 这将在空格上拆分你的行并给你一个包含所有单词的数组。

  3. 单词,为当前文档的单词计数加1。这可以使用 HashMap 来完成。

  1. Read the file in line by line with a reader, like new BufferedReader(new FileReader(filename)) - you can call BufferedReader.readLine() in a while loop, checking for null each time.
  2. For each line, call line.split("\\s") - that will split your line on whitespace and give you an array of all of the words.
  3. For each word, add 1 to the word's count for the current document. This could be done using a HashMap.

现在,在计算D之后每个文档,您将有X值,其中X是文档的数量。将所有文档相互比较只是进行X ^ 2比较 - 这不应该花费10,000特别长。请记住,如果两个文档的D值之间的差值的绝对值较低,则它们会更相似。因此,您可以计算每对文档的Ds之间的差异,并将其存储在优先级队列或其他一些排序结构中,以便最相似的文档冒泡到顶部。合理?

Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?

这篇关于java中Tf Idf的任何教程或代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆