Java 中 Tf Idf 的任何教程或代码 [英] Any tutorial or code for Tf Idf in java

查看:20
本文介绍了Java 中 Tf Idf 的任何教程或代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个可以计算 tf-idf 计算的简单 java 类.我想对 2 个文档进行相似性测试.我发现了很多使用 tf-idf 类的 BIG API.我不想使用大的 jar 文件,只是为了做我的简单测试.请帮忙 !或者至少有人可以告诉我如何找到TF?和以色列国防军?我会计算结果:)或者如果你能告诉我一些好的 Java 教程.请不要告诉我寻找谷歌,我已经做了 3 天,但找不到任何东西:(也请不要向我推荐 Lucene :(

I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :) OR If you can tell me some good java tutorial for this. Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :( Please also do not refer me to Lucene :(

推荐答案

词频是词条在特定文档中出现次数的平方根.

Term Frequency is the square root of the number of times a term occurs in a particular document.

逆文档频率是((文档总数除以包含该词的文档数)的对数)在该词出现零次的情况下加一——如果出现了,显然不要尝试除法零.

Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.

如果从答案中看不清楚,每个文档的每个术语都有一个 TF,每个术语都有一个 IDF.

If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.

然后 TF-IDF(term, document) = TF(term, document) * IDF(term)

And then TF-IDF(term, document) = TF(term, document) * IDF(term)

最后,您使用向量空间模型来比较文档,其中每个术语是一个新维度,指向该维度的向量部分的长度"是 TF-IDF 计算.每个文档都是一个向量,因此计算两个向量,然后计算它们之间的距离.

Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.

因此,要在 Java 中执行此操作,请使用 FileReader 或其他东西一次读取一行文件,并在空格或您想要使用的任何其他分隔符上进行拆分——每个单词都是一个术语.计算每个术语在每个文件中出现的次数,以及每个术语出现在文件中的数量.然后你就拥有了进行上述计算所需的一切.

So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.

由于我无事可做,我查了一下矢量距离公式.给你:

And since I have nothing else to do, I looked up the vector distance formula. Here you go:

D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)

为此,x1 是文档 1 中术语 x 的 TF-IDF.

For this purpose, x1 is the TF-IDF for term x in document 1.

回答您关于如何计算文档中字数的问题:

in response to your question about how to count the words in a document:

  1. 用阅读器逐行读取文件,比如new BufferedReader(new FileReader(filename))——你可以在一段时间内调用BufferedReader.readLine()循环,每次检查是否为空.
  2. 对于每一行,调用 line.split("\s") - 这将在空白处拆分您的行并为您提供所有单词的数组.
  3. 对于每个单词,将当前文档的单词计数加 1.这可以使用 HashMap 来完成.
  1. Read the file in line by line with a reader, like new BufferedReader(new FileReader(filename)) - you can call BufferedReader.readLine() in a while loop, checking for null each time.
  2. For each line, call line.split("\s") - that will split your line on whitespace and give you an array of all of the words.
  3. For each word, add 1 to the word's count for the current document. This could be done using a HashMap.

现在,在为每个文档计算 D 后,您将有 X 值,其中 X 是文档数.将所有文档相互比较就是只进行 X^2 次比较——这对于 10,000 次来说应该不会特别长.请记住,如果两个文档的 D 值之间差异的绝对值较低,则它们更相似.因此,您可以计算每对文档的 D 之间的差异,并将其存储在优先级队列或其他排序结构中,以便最相似的文档向上冒泡.说得通?

Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?

这篇关于Java 中 Tf Idf 的任何教程或代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆