在网页上找到最频繁的单词(使用Jsoup)? [英] Find most frequent words on a webpage (using Jsoup)?

查看:101
本文介绍了在网页上找到最频繁的单词(使用Jsoup)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的项目中,我必须计算维基百科文章中最常用的词。我发现Jsoup解析HTML格式,但仍然留下词频问题。在Jsoup中是否有一个函数可以计算单词的频率,或者通过任何方式来查找哪些单词在网页上最频繁使用Jsoup?



谢谢。

解决方案

是的,您可以使用Jsoup从网页获取文本,如下所示:

  Document doc = Jsoup.connect(http://en.wikipedia.org/).get(); 
String text = doc.body()。text();

然后,您需要计算单词并找出哪些是最常见的单词。 此代码看起来很有希望。我们需要修改它以使用Jsoup的字符串输出,如下所示:

  import java.io. *; 
import java.nio.charset.StandardCharsets;
import java.util。*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupWordCount {

public static void main(String [] args)throws IOException {
long time = System.currentTimeMillis();

地图< String,Word> countMap = new HashMap< String,Word>();

//连接到维基百科并获取HTML
System.out.println(正在下载页面...);
Document doc = Jsoup.connect(http://en.wikipedia.org/).get();

//从页面获取实际文本,不包括HTML
String text = doc.body()。text();

System.out.println(分析文本...);
//创建BufferedReader,这样可以计算单词
BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
字符串行; ((line = reader.readLine())!= null){
String [] words = line.split([^ A-ZÄâ€a-zÃ¥ äö] +);
for(String word:words){
if(.equals(word)){
continue;
}

Word wordObj = countMap.get(word);
if(wordObj == null){
wordObj = new Word();
wordObj.word = word;
wordObj.count = 0;
countMap.put(word,wordObj);
}

wordObj.count ++;
}
}

reader.close();

SortedSet< Word> sortedWords = new TreeSet< Word>(countMap.values());
int i = 0;
int maxWordsToDisplay = 10;

String [] wordsToIgnore = {the,and,a};

(Word word:sortedWords){
if(i> = maxWordsToDisplay){// 10是您想为
break显示频率的字数;
}

if(Arrays.asList(wordsToIgnore).contains(word.word)){
i ++;
maxWordsToDisplay ++;
} else {
System.out.println(word.count +\ t+ word.word);
i ++;
}

}

time = System.currentTimeMillis() - time;

System.out.println(Finished in+ time +ms);
}

public static class Word实现了Comparable< Word> {
字符串字;
int count;

@Override
public int hashCode(){return word.hashCode(); }

@Override
public boolean equals(Object obj){return word.equals(((Word)obj).word); }

@Override
public int compareTo(Word b){return b.count - count; }
}
}

输出:

下载页面...
分析文本...
42
24 in
20维基百科
19到
16是
11 that
10
9是
8篇文章
7特色
完成于3300 ms

一些注释:


  • 这段代码可以忽略一些单词,比如the,and,a等等。您必须定制它。


  • 有时候它似乎有unicode字符的问题。虽然我没有遇到这种情况,但评论中的某些人确实如此。


  • 未经过充分测试。




b

In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ?

Thanks.

解决方案

Yes, you could use Jsoup to get the text from the webpage, like this:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();

Then, you need to count the words and find out which ones are the most frequent ones. This code looks promising. We need to modify it to use our String output from Jsoup, something like this:

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupWordCount {

   public static void main(String[] args) throws IOException {
        long time = System.currentTimeMillis();

        Map<String, Word> countMap = new HashMap<String, Word>();

        //connect to wikipedia and get the HTML
        System.out.println("Downloading page...");
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

        //Get the actual text from the page, excluding the HTML
        String text = doc.body().text();

        System.out.println("Analyzing text...");
        //Create BufferedReader so the words can be counted
        BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
        String line;
        while ((line = reader.readLine()) != null) {
            String[] words = line.split("[^A-ZÅÄÃâ€"a-zåäö]+");
            for (String word : words) {
                if ("".equals(word)) {
                    continue;
                }

                Word wordObj = countMap.get(word);
                if (wordObj == null) {
                    wordObj = new Word();
                    wordObj.word = word;
                    wordObj.count = 0;
                    countMap.put(word, wordObj);
                }

                wordObj.count++;
            }
        }

        reader.close();

        SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
        int i = 0;
        int maxWordsToDisplay = 10;

        String[] wordsToIgnore = {"the", "and", "a"};

        for (Word word : sortedWords) {
            if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
                break;
            }

            if (Arrays.asList(wordsToIgnore).contains(word.word)) {
                i++;
                maxWordsToDisplay++;
            } else {
                System.out.println(word.count + "\t" + word.word);
                i++;
            }

        }

        time = System.currentTimeMillis() - time;

        System.out.println("Finished in " + time + " ms");
    }

    public static class Word implements Comparable<Word> {
        String word;
        int count;

        @Override
        public int hashCode() { return word.hashCode(); }

        @Override
        public boolean equals(Object obj) { return word.equals(((Word)obj).word); }

        @Override
        public int compareTo(Word b) { return b.count - count; }
    }
}

Output:

Downloading page...
Analyzing text...
42  of
24  in
20  Wikipedia
19  to
16  is
11  that
10  The
9   was
8   articles
7   featured
Finished in 3300 ms

Some notes:

  • This code can ignore some words, like "the", "and", "a" etc. You will have to customize it.

  • It seems to have problems with unicode characters sometimes. Although I don't experience this, someone in the comments did.

  • This could be done better and with less code.

  • Not well tested.

Enjoy !

这篇关于在网页上找到最频繁的单词(使用Jsoup)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆