用Java计算.txt文件中单词的频率 [英] Counting frequency of words from a .txt file in java

查看:43
本文介绍了用Java计算.txt文件中单词的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事Comp Sci作业.最后,程序将确定文件是用英语还是法语编写的.现在,我正在努力计算一种.txt文件中出现的单词出现频率的方法.

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.

我在各自的文件夹中分别标记有1-20个英语和法语文本文件.该方法要求一个目录(本例中为"docs/train/eng/"或"docs/train/fre/")以及程序应通过的文件数量(每个文件夹中有20个文件).然后,它读取该文件,将所有单词分开(我不必担心大写或标点符号),并将每个单词以及它们在文件中的存储次数放入HashMap中.(键=单词,值=频率).

I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).

这是我为该方法想到的代码:

This is the code I came up with for the method:

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

代码会编译.不幸的是,当我要求它打印20个文件的所有单词计数的结果时,它就打印了这个.它完全是乱七八糟的(尽管肯定有这些词),完全不是我需要的方法.

The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.

如果有人可以帮助我调试我的代码,我将不胜感激.我已经参加了很多年了,一次又一次地进行测试,我准备放弃.

If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

推荐答案

让我在这里结合所有好的答案.

Let me combine all the good answers here.

1)拆分您的方法以分别处理一件事.一个将文件读取为strings [],一个用于处理strings [],一个调用前两个.

1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.

2)拆分时,请深思如何拆分.由于@ m0skit0建议您应该使用\ b来解决此问题.

2) When you split think deeply about how you want to split. As @m0skit0 suggest you should likely split with \b for this problem.

3)按照@jas的建议,您应该首先检查地图中是否已包含单词.如果确实增加了计数,则如果未增加,则将单词添加到地图并将其计数设置为1.

3) As @jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.

4)要以您期望的方式打印地图,请查看以下内容:

4) To print out the map in the way you likely expect, take a look at the below:

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

这篇关于用Java计算.txt文件中单词的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆