文档中的字数统计频率 [英] word count frequency in document
问题描述
我有一个目录,其中有1000个txt.files.我想知道每个单词在1000个文档中出现了多少次.因此,即使X上出现了牛"一词,也要算作一个.如果它出现在其他文档中,则将其加一.因此,如果每个文档中都出现牛",则最大值为1000.如何在不使用任何其他外部库的情况下以简便的方式执行此操作.这是我到目前为止的内容
I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far
private Hashtable<String, Integer> getAllWordCount()
private Hashtable<String, Integer> getAllWordCount()
{
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
try {
for (int j = 0; j < fileDirectory.length; j++){
File theDirectory = new File(fileDirectory[j]);
File[] children = theDirectory.listFiles();
for (int i = 0; i < children.length; i++){
Scanner scanner = new Scanner(new FileReader(children[i]));
while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
if (words.contains(text) == false){
if (result.get(text) == null)
result.put(text, 1);
else
result.put(text, result.get(text) + 1);
words.add(text);
}
}
}
words.clear();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(result.size());
return result;
}
推荐答案
您还需要一个HashSet<String>
,用于存储从当前文件读取的每个唯一单词.
You also need a HashSet<String>
in which you store each unique word you've read from the current file.
然后,在读取每个单词之后,您应该检查它是否在集合中,如果不是,则在result
映射中增加相应的值(或者如果它为空,则添加一个新条目,就像您已经做过的那样)并将单词添加到集合中.
Then after every word read, you should check if it's in the set, if it isn't, increment the corresponding value in the result
map (or add a new entry if it was empty, like you already do) and add the word to the set.
不过,当您开始读取新文件时,请不要忘记重置设置.
Don't forget to reset the set when you start to read a new file though.
这篇关于文档中的字数统计频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!