couning单词数occurence在一个文件 [英] couning the number of words occurence in a File
问题描述
考虑到我们有 TXT
文件,我们想知道,多少次的是出现了 TXT
每个字。我用下面的code,但它不工作。它给所有值1。
首先,我读 TXT
文件,并在一个单独的行写的每一个字。在同一时间,我把它们放在数组列表。再后来,我读了 TXT
文件的第一行,并获取数组列表的第一个元素,并与比较全 TXT
文件。如果任何发生,增加一个阵列,显示发生次数。再取第二个数组列表项等,直到我们达到数组列表的末端。
私有静态无效计数(字符串文本)抛出FileNotFoundException异常,IOException异常{ FileOutputStream中thewords =新的FileOutputStream(检查); ArrayList的<串GT; keyArrayList =新的ArrayList<串GT;();
INT countWord = 0; StringTokenizer的标记生成器=新的StringTokenizer(文本);
而(tokenizer.hasMoreTokens())
{
串nextWord = tokenizer.nextToken();
keyArrayList.add(nextWord);
thewords.write(nextWord.getBytes());
thewords.write(System.getProperty(line.separator)的getBytes());
countWord ++;
}
INT [] = numbOfOccurance新INT [countWord] BR的BufferedReader =新的BufferedReader(新的FileReader(检查));
字符串的ReadLine;
对(INT loopIndex = 0; loopIndex&下; countWord; loopIndex ++)
{
的ReadLine = br.readLine();
字符串测试= keyArrayList.get(loopIndex);
如果(test.equals(readline的))
{
numbOfOccurance [loopIndex] ++; } }
您的方法是慢得令人难以置信,你有订单,如果找出在整个的ArrayList
搜索一个词出现一次以上。
此外,的StringTokenizer
是pcated德$ P $。
我建议以下办法:
进口静态java.util.function.Function.identity;
引入静态java.util.stream.Collectors.toMap;公共静态无效的主要(字串[] args)抛出异常{
最终路径path = Paths.get(路径,来,文件);
最终地图<字符串,整数>数= countOccurrences(路径);
}私有静态地图<字符串,整数> countOccurrences(路径路径)抛出IOException
最终的模式模式= Pattern.compile([^ A-ZA-Z'] +);
尝试(最终流<串GT;线= Files.lines(路径)){
返回线
.flatMap(模式:: splitAsStream)
.collect(toMap(身份()中,W - →1,整数::总和));
}
}
这使用Java 8 流
API来从文件中读取行。然后,它分割的行[^ A-ZA-Z'] +
,即非字,不撇号人物 - 使用的 flatMap
来创建一个流
的各个单词。
我们再使用 地图
以收集
的话,因为我们把每个字 1
到地图
。然后,我们使用合并功能整数::总和
已增加值一起在地图
。
您可以然后列出的地图,由发生排序,使用下面的内容:
counts.entrySet()流。()
.sorted(Map.Entry.comparingByValue())
.MAP(E - >的String.format(%S - >%S,e.getKey(),e.getValue()))
.forEach(的System.out ::的println);
considering we have txt
file and we wish to know that how many times each words of the txt
is appeared. I used the following code but it does not work. it gives all values 1 .
First I read the txt
file and write each word in a separate line. at the same time, I put them in the Array List. then later, I read first line of the txt
file and fetch the first element of the Array List and make comparison with the whole txt
file. if any occurrence, increasing one to an array that shows the number of occurrence. and then fetching the second Array List item and so on until we reach the end of Array List.
private static void count(String text) throws FileNotFoundException, IOException {
FileOutputStream thewords=new FileOutputStream(Check);
ArrayList<String> keyArrayList=new ArrayList<String>();
int countWord=0;
StringTokenizer tokenizer =new StringTokenizer(text) ;
while(tokenizer.hasMoreTokens())
{
String nextWord=tokenizer.nextToken();
keyArrayList.add(nextWord);
thewords.write(nextWord.getBytes());
thewords.write(System.getProperty("line.separator").getBytes());
countWord++;
}
int[] numbOfOccurance=new int[countWord];
BufferedReader br=new BufferedReader(new FileReader(Check));
String readline;
for(int loopIndex=0;loopIndex<countWord;loopIndex++)
{
readline=br.readLine();
String test=keyArrayList.get(loopIndex);
if(test.equals(readline))
{
numbOfOccurance[loopIndex]++;
}
}
Your method is incredibly slow, you have to search through the entire ArrayList
in order to find out if a word appears more than once.
Further, StringTokenizer
is deprecated.
May I suggest the following approach:
import static java.util.function.Function.identity;
import static java.util.stream.Collectors.toMap;
public static void main(String[] args) throws Exception {
final Path path = Paths.get("path", "to", "file");
final Map<String, Integer> counts = countOccurrences(path);
}
private static Map<String, Integer> countOccurrences(Path path) throws IOException {
final Pattern pattern = Pattern.compile("[^A-Za-z']+");
try (final Stream<String> lines = Files.lines(path)) {
return lines
.flatMap(pattern::splitAsStream)
.collect(toMap(identity(), w -> 1, Integer::sum));
}
}
This uses the Java 8 Stream
API to read lines from a file. It then splits the lines on [^A-Za-z']+
, i.e. non-word, non-apostrophe, characters - using flatMap
to create a Stream
of individual words.
We then use a Map
to collect
the words, for each word we put 1
into the Map
. We then use the merging function Integer::sum
to add together values already in the Map
.
You can then list the contents of the Map
, sorted by occurrence, using the following:
counts.entrySet().stream()
.sorted(Map.Entry.comparingByValue())
.map(e -> String.format("%s -> %s", e.getKey(), e.getValue()))
.forEach(System.out::println);
这篇关于couning单词数occurence在一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!