在网页上找到最频繁的单词（使用Jsoup）？ [英] Find most frequent words on a webpage (using Jsoup)?

查看：101 发布时间：2018/6/21 16:36:41 java html jsoup webpage word-frequency

本文介绍了在网页上找到最频繁的单词（使用Jsoup）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的项目中，我必须计算维基百科文章中最常用的词。我发现Jsoup解析HTML格式，但仍然留下词频问题。在Jsoup中是否有一个函数可以计算单词的频率，或者通过任何方式来查找哪些单词在网页上最频繁使用Jsoup？

谢谢。

解决方案

是的，您可以使用Jsoup从网页获取文本，如下所示：

  Document doc = Jsoup.connect（http://en.wikipedia.org/）.get（）; 
 String text = doc.body（）。text（）;

然后，您需要计算单词并找出哪些是最常见的单词。此代码看起来很有希望。我们需要修改它以使用Jsoup的字符串输出，如下所示：

  import java.io. *; 
 import java.nio.charset.StandardCharsets; 
 import java.util。*; 
 import org.jsoup.Jsoup; 
 import org.jsoup.nodes.Document; 
 
 public class JsoupWordCount {
 
 public static void main（String [] args）throws IOException {
 long time = System.currentTimeMillis（）; 
 
地图< String，Word> countMap = new HashMap< String，Word>（）; 
 
 //连接到维基百科并获取HTML 
 System.out.println（正在下载页面...）; 
 Document doc = Jsoup.connect（http://en.wikipedia.org/）.get（）; 
 
 //从页面获取实际文本，不包括HTML 
 String text = doc.body（）。text（）; 
 
 System.out.println（分析文本...）; 
 //创建BufferedReader，这样可以计算单词
 BufferedReader reader = new BufferedReader（new InputStreamReader（new ByteArrayInputStream（text.getBytes（StandardCharsets.UTF_8））））; 
字符串行; （（line = reader.readLine（））！= null）{
 String [] words = line.split（[^ A-ZÃƒâ€žâ€a-zÃƒÂ¥ ÃƒÂ¤ÃƒÂ¶] +）; 
 for（String word：words）{
 if（.equals（word））{
 continue; 
} 
 
 Word wordObj = countMap.get（word）; 
 if（wordObj == null）{
 wordObj = new Word（）; 
 wordObj.word = word; 
 wordObj.count = 0; 
 countMap.put（word，wordObj）; 
} 
 
 wordObj.count ++; 
} 
} 
 
 reader.close（）; 
 
 SortedSet< Word> sortedWords = new TreeSet< Word>（countMap.values（））; 
 int i = 0; 
 int maxWordsToDisplay = 10; 
 
 String [] wordsToIgnore = {the，and，a}; 
 
（Word word：sortedWords）{
 if（i> = maxWordsToDisplay）{// 10是您想为
 break显示频率的字数; 
} 
 
 if（Arrays.asList（wordsToIgnore）.contains（word.word））{
 i ++; 
 maxWordsToDisplay ++; 
} else {
 System.out.println（word.count +\ t+ word.word）; 
 i ++; 
} 
 
} 
 
 time = System.currentTimeMillis（） -  time; 
 
 System.out.println（Finished in+ time +ms）; 
} 
 
 public static class Word实现了Comparable< Word> {
字符串字; 
 int count; 
 
 @Override 
 public int hashCode（）{return word.hashCode（）; } 
 
 @Override 
 public boolean equals（Object obj）{return word.equals（（（Word）obj）.word）; } 
 
 @Override 
 public int compareTo（Word b）{return b.count  -  count; } 
} 
}

输出：

下载页面... 分析文本... 42 24 in 20维基百科 19到 16是 11 that 10 9是 8篇文章 7特色完成于3300 ms
一些注释：

这段代码可以忽略一些单词，比如the，and，a等等。您必须定制它。

有时候它似乎有unicode字符的问题。虽然我没有遇到这种情况，但评论中的某些人确实如此。

未经过充分测试。

b
In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ?

Thanks.
解决方案
Yes, you could use Jsoup to get the text from the webpage, like this:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); String text = doc.body().text();
Then, you need to count the words and find out which ones are the most frequent ones. This code looks promising. We need to modify it to use our String output from Jsoup, something like this:
import java.io.*; import java.nio.charset.StandardCharsets; import java.util.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupWordCount { public static void main(String[] args) throws IOException { long time = System.currentTimeMillis(); Map<String, Word> countMap = new HashMap<String, Word>(); //connect to wikipedia and get the HTML System.out.println("Downloading page..."); Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); //Get the actual text from the page, excluding the HTML String text = doc.body().text(); System.out.println("Analyzing text..."); //Create BufferedReader so the words can be counted BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8)))); String line; while ((line = reader.readLine()) != null) { String[] words = line.split("[^A-ZÃƒâ€¦Ãƒâ€žÃƒâ€"a-zÃƒÂ¥ÃƒÂ¤ÃƒÂ¶]+"); for (String word : words) { if ("".equals(word)) { continue; } Word wordObj = countMap.get(word); if (wordObj == null) { wordObj = new Word(); wordObj.word = word; wordObj.count = 0; countMap.put(word, wordObj); } wordObj.count++; } } reader.close(); SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values()); int i = 0; int maxWordsToDisplay = 10; String[] wordsToIgnore = {"the", "and", "a"}; for (Word word : sortedWords) { if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for break; } if (Arrays.asList(wordsToIgnore).contains(word.word)) { i++; maxWordsToDisplay++; } else { System.out.println(word.count + "\t" + word.word); i++; } } time = System.currentTimeMillis() - time; System.out.println("Finished in " + time + " ms"); } public static class Word implements Comparable<Word> { String word; int count; @Override public int hashCode() { return word.hashCode(); } @Override public boolean equals(Object obj) { return word.equals(((Word)obj).word); } @Override public int compareTo(Word b) { return b.count - count; } } }
Output:
Downloading page... Analyzing text... 42 of 24 in 20 Wikipedia 19 to 16 is 11 that 10 The 9 was 8 articles 7 featured Finished in 3300 ms
Some notes:

This code can ignore some words, like "the", "and", "a" etc. You will have to customize it.

It seems to have problems with unicode characters sometimes. Although I don't experience this, someone in the comments did.

This could be done better and with less code.

Not well tested.

Enjoy !

这篇关于在网页上找到最频繁的单词（使用Jsoup）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在网页上找到最频繁的单词（使用Jsoup）？ [英] Find most frequent words on a webpage (using Jsoup)?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在网页上找到最频繁的单词（使用Jsoup）？ [英] Find most frequent words on a webpage (using Jsoup)?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭