读取一个.txt文件,并返回文件中包含单词频率的单词列表 [英] Read a .txt file and return a list of words with their frequency in the file

查看:67
本文介绍了读取一个.txt文件,并返回文件中包含单词频率的单词列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我已经知道了,但是它只将.txt文件打印到屏幕上:

I have this so far but it only prints the .txt file to the screen:

import java.io.*;

public class ReadFile {
    public static void main(String[] args) throws IOException {
        String Wordlist;
        int Frequency;

        File file = new File("file1.txt");
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        String line = null;

        while( (line = br.readLine()) != null) {
            String [] tokens = line.split("\\s+");
            System.out.println(line);
        }
    }
}

有人可以帮助我,以便打印单词列表和单词频率吗?

Can anyone help me so it prints a word list and the words frequency?

推荐答案

执行以下操作.我假设文件中只能出现逗号或句点.否则,您还必须删除其他标点符号.我正在使用TreeMap,因此地图中的单词将按照其自然字母顺序存储

Do something like this. I'm assuming only comma or period could occur in the file. Else you'll have to remove other punctuation characters as well. I'm using a TreeMap so the words in the map will be stored their natural alphabetical order

  public static TreeMap<String, Integer> generateFrequencyList()
    throws IOException {
    TreeMap<String, Integer> wordsFrequencyMap = new TreeMap<String, Integer>();
    String file = "/tmp/lorem.txt";
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    while( (line = br.readLine()) != null){
         String [] tokens = line.split("\\s+");
      for (String token : tokens) {
        token = removePunctuation(token);
        if (!wordsFrequencyMap.containsKey(token.toLowerCase())) {
          wordsFrequencyMap.put(token.toLowerCase(), 1);
        } else {
          int count = wordsFrequencyMap.get(token.toLowerCase());
          wordsFrequencyMap.put(token.toLowerCase(), count + 1);
        }
      }
    }
    return wordsFrequencyMap;
  }

  private static String removePunctuation(String token) {
    token = token.replaceAll("[^a-zA-Z]", "");
    return token;
  }

主要测试方法如下所示.为了获得百分比,您可以通过遍历地图并添加所有值来获得所有单词的计数,然后进行第二次遍历以获取百分比.顺便说一句,如果这是一项较大的工作,您还可以查看用于计算

main method for testing is shown below. For getting the percentages, you could get count of all the words by iterating through the map and adding all the values and then do a second pass for getting the percentages. By the way, if this is part of a larger work, you could also take a look at apache commons math library for calculating Frequency distributions. If you use their Frequency class, you can keep adding all the words to it and then get the descriptive statistics at the end.

  public static void main(String[] args) {
    try {
      int totalWords = 0;   
      TreeMap<String, Integer> freqMap = generateFrequencyList();
      for (String key : freqMap.keySet()) {
        totalWords += freqMap.get(key);
      }

      System.out.println("Word\tCount\tPercentage");
      for (String key : freqMap.keySet()) {
         System.out.println(key+"\t"+freqMap.get(key)+"\t"+((double)freqMap.get(key)*100.0/(double)totalWords));    
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

这篇关于读取一个.txt文件,并返回文件中包含单词频率的单词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆