在Java Web爬网程序中实现线程 [英] Implementing Threads Into Java Web Crawler

查看:38
本文介绍了在Java Web爬网程序中实现线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在其中编写的原始网络爬虫:(仅供参考)

Here is the original web crawler in which i wrote: (Just for reference)

https://github.com/domshahbazi/java-webcrawler/tree/master

这是一个简单的Web爬网程序,它访问给定的初始网页,将页面中的所有链接抓取,并将它们添加到队列(LinkedList)中,然后在其中逐个弹出并逐个访问它们,在此循环重新开始.为了加快程序速度和进行学习,我尝试使用线程来实现,这样我就可以让多个线程一次运行,从而在更少的时间内索引更多的页面.下面是每个班级:

This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:

主类

public class controller {

    public static void main(String args[]) throws InterruptedException {

        DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");

        Thread crawl1 = new Crawler(data);
        Thread crawl2 = new Crawler(data);

        crawl1.start();
        crawl2.start();
   }    
}

爬网类(线程)

public class Crawler extends Thread {

    /** Instance of Data Structure **/
    DataStruc data;

    /** Number of page connections allowed before program terminates **/
    private final int INDEX_LIMIT = 10;

    /** Initial URL to visit **/
    public Crawler(DataStruc d) {
        data = d;
    }

    public void run() {

        // Counter to keep track of number of indexed URLS
        int counter = 0;

        // While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

            // Pop next URL to visit from stack
            String currentUrl = data.getURL();

            try {
                // Fetch and parse HTML document
                Document doc = Jsoup.connect(currentUrl)                 
                        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
                        .referrer("http://www.google.com")
                        .timeout(12000)
                        .followRedirects(true)
                        .get();

                // Increment counter if connection to web page succeeds
                counter++;

                /** .select returns a list of elements (links in this case) **/
                Elements links = doc.select("a[href]"); // Relative URL

                // Add newly found links to stack
                addLinksToQueue(links);                             

            } catch (IOException e) {
                //e.printStackTrace();
                System.out.println("Error: "+currentUrl);
            }               
        }       
    }

    public void addLinksToQueue(Elements el) {
        // For each element in links
        for(Element e : el) {           

            String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')

            if(theLink.startsWith("http") && !data.oldLink(theLink)) {
                data.addURL(theLink);
                data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
                System.out.println(theLink);
            }               
        }   
    }
}

DataStruc类

public class DataStruc {

    /** Queue to store URL's, can be accessed by multiple threads **/
    private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();

    /** ArrayList of visited URL's **/
    private ArrayList<String> visited_url = new ArrayList<String>();

    public DataStruc(String initial_url) {
        url_to_visit.offer(initial_url);
    }

    // Method to add seed URL to queue
    public void addURL(String url) {
        url_to_visit.offer(url);
    }

    // Get URL at front of queue
    public String getURL() {
        return url_to_visit.poll();
    }

    // URL to visit size
    public int url_to_visit_size() {
        return url_to_visit.size();
    }

    // Add visited URL
    public void addVisitedURL(String url) {
        visited_url.add(url);
    }

    // Checks if link has already been visited
    public boolean oldLink(String link) {
        for(String s : visited_url) {
            if(s.equals(link)) {
                return true;
            }
        }   
        return false;
    }       
}

DataStruc是共享的数据结构类,它将由Crawler.java线程的每个实例同时访问. DataStruc有一个队列来存储要访问的链接,还有一个数组列表来存储访问的URL,以防止进入循环.我使用了ConcurrentLinkedQueue来存储要访问的URL,因为我认为它可以处理并发访问.我不需要并发访问URL的数组列表进行并发访问,因为我需要做的就是添加到此并对其进行迭代以检查是否匹配.

DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.

我的问题是,当我比较使用单线程的运行时间与使用2个线程(在同一URL上)的运行时间进行比较时,我的单线程版本似乎运行得更快.我觉得我没有正确实现线程,如果有人可以指出问题,我想提供一些提示吗?

My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?

谢谢!

推荐答案

已添加:请参阅我的评论,我认为要对Crawler进行检查

Added: see my comment, I think the check in Crawler

// While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

是错误的.由于第一个线程轮询了唯一的URL,因此第二个线程将立即停止.

is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.

您可以忽略其余的,但留作历史记录...

You can ignore the remaining, but left for history ...

我对这类可以并行运行的大块"的一般做法是:

My general approach to such types of "big blocks that can run in parallel" is:

  1. 使每个搜寻器可调用.可能Callable<List<String>>
  2. 将它们提交给ExecutorService
  3. 完成后,一次获取一个结果 并将其添加到列表中.
  1. Make each crawler a Callable. Probably Callable<List<String>>
  2. Submit them to an ExecutorService
  3. When they complete, take the results one at a time and add them to a List.

使用此策略,根本不需要使用任何并发列表.缺点是您无法获得大量实时反馈.而且,如果它们返回的是巨大的,您可能需要担心内存问题.

Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.

这是否适合您的需求?您将不得不担心addVisitedURL,因此仍然需要将其作为并发数据结构.

Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.

已添加:由于您是从单个URL开头的,因此该策略不适用.您可以在访问第一个URL之后应用它.

Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.

这篇关于在Java Web爬网程序中实现线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆