改善crawler4j的性能 [英] Improving performance of crawler4j

查看:137
本文介绍了改善crawler4j的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写一个网络抓取器,在大约100万个网站上抓取并将其标题,描述和关键字保存到1个大文件中(包含抓取的URL和相关单词).网址应从大文件中提取.

I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file.

我已经在1M URL文件上运行了Crawler4j,并使用以下代码启动了网络爬虫:controller.start(MyCrawler.class, 20). 20是任意数字.每个搜寻器将结果单词传递到阻塞队列中,以供单个线程将这些单词和URL写入文件.我使用了1个编写器线程,以便不对文件进行同步.我将抓取深度设置为0(我只需要抓取种子列表)

I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20). 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've used 1 writer thread in order to not synchronize on the file. I set the crawl depth to 0 (I only need to crawl my seed list)

在运行了这个晚上之后,我只下载了大约200K的URL.我正在使用有线连接在一台机器上运行刮板.由于大多数URL都是不同的主机,因此我认为politeness参数在这里没有任何重要性.

After running this for the night I've only downloaded around 200K of URLs. I'm running the scraper on 1 machine using a wired connection. Since most of the URLs are of different hosts I don't think the politeness parameter has any importance here.

编辑

我尝试使用非阻塞启动方式启动Crawler4j,但它刚被阻塞.我的Crawler4j版本是:4.2.这是我正在使用的代码:

I tried starting the Crawler4j using the nonblocking start but it just got blocked. My Crawler4j version is: 4.2. This is the code I'm using:

CrawlConfig config = new CrawlConfig();
List<Header> headers = Arrays.asList(
        new BasicHeader("Accept", "text/html,text/xml"),
        new BasicHeader("Accept-Language", "en-gb, en-us, en-uk")
);
config.setDefaultHeaders(headers);
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(0);
config.setUserAgentString("testcrawl");
config.setIncludeBinaryContentInCrawling(false);
config.setPolitenessDelay(10);

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

BlockingQueue<String> urlsQueue = new ArrayBlockingQueue<>(400);
controller = new CrawlController(config, pageFetcher, robotstxtServer);

ExecutorService executorService = Executors.newSingleThreadExecutor();
Runnable writerThread = new FileWriterThread(urlsQueue, crawlStorageFolder, outputFile);

executorService.execute(writerThread);

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

File file = new File(urlsFileName);
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
    String url;
    while ((url = br.readLine()) != null) {
        controller.addSeed(url);
    }
}

编辑1-这是MyCrawler的代码

EDIT 1 - This is the code for MyCrawler

public class MyCrawler extends WebCrawler {
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
    public static final String DELIMETER = "||||";
    private final StringBuilder buffer = new StringBuilder();
    private final BlockingQueue<String> urlsQueue;

    public MyCrawler(BlockingQueue<String> urlsQueue) {
        this.urlsQueue = urlsQueue;
    }

    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches();
    }

    @Override
    public void visit(Page page) {
        String url = page.getWebURL().getURL();

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData parseData = (HtmlParseData) page.getParseData();
            String html = parseData.getHtml();
            String title = parseData.getTitle();

            Document document = Jsoup.parse(html);
            buffer.append(url.replaceAll("[\n\r]", "")).append(DELIMETER).append(title);
            Elements descriptions = document.select("meta[name=description]");
            for (Element description : descriptions) {
                if (description.hasAttr("content"))
                    buffer.append(description.attr("content").replaceAll("[\n\r]", ""));
            }

            Elements elements = document.select("meta[name=keywords]");
            for (Element element : elements) {
                String keywords = element.attr("content").replaceAll("[\n\r]", "");
                buffer.append(keywords);
            }
            buffer.append("\n");
            String urlContent = buffer.toString();
            buffer.setLength(0);
            urlsQueue.add(urlContent);
        }
    }

    private boolean isSuccessful(int statusCode) {
        return 200 <= statusCode && statusCode < 400;
    }
}

所以我有2个问题:

  1. 有人可以提出其他方法来缩短此过程的时间吗?也许以某种方式调整搜寻器线程的数量?也许还有其他一些优化?我更喜欢不需要几台机器的解决方案,但是如果您认为这是唯一的角色扮演方式,那么有人可以建议如何做到这一点吗?也许是代码示例?
  2. 有什么方法可以使搜寻器开始在某些URL上工作并在搜寻期间继续添加更多URL?我看过crawler.startNonBlocking,但效果似乎不太好
  1. can someone suggest any other way to make this process take less time? Maybe somehow tuning the number of crawler threads ? Maybe some other optimizations? I'd prefer a solution that doesn't require several machines but if you think that's the only way to role could someone suggest how to do that? maybe a code example?
  2. Is there any way to make the crawler start working on some URLs and keep adding more URLs during the crawl? I've looked at crawler.startNonBlocking but it doesn't seem to work very well

预先感谢

推荐答案

crawler4j默认情况下设计为在一台计算机上运行.从web-crawling字段中,我们知道网络爬网程序的性能主要取决于以下四个资源:

crawler4j is per default designed to be run on one machine. From the field of web-crawling, we know, that web-crawler performance depends primary on the following four resources:

  • 磁盘
  • CPU
  • 带宽
  • (RAM)

定义最佳线程数取决于您的硬件设置.因此,更多的机器将导致更高的吞吐量.下一个硬限制是网络带宽.如果您没有通过高速Internet连接,这将是您的方法的瓶颈.

Defining the optimal number of threads depends on your hardware setup. Thus, more machines will result in a higher throughput. The next hard limitation is the network bandwidth. If you are not attached via highspeed Internet, this will be the bottleneck of your approach.

此外,crawler4j并非旨在默认情况下加载如此大的种子文件.这是由于crawler4j代表抓取工具的礼貌.这意味着-在抓取开始之前-检查每个种子点是否存在robots.txt,这可能会花费很多时间.

Moreover, crawler4j is not designed to load such a huge seed file per default. This is due to the fact, that crawler4j resepcts crawler politness. This implies, that - before the crawl starts - every seed point is checked for a robots.txt, which can take quite a bit of time.

如果在非阻塞模式下开始爬网,则可以在爬网开始之后添加种子,并且应该可以.但是,可能需要一段时间才能处理完网址.

Adding seeds after the crawl is started, is possible and should work, if the crawl is started in non-blocking mode. However, it can take a while until the URLs are processed.

对于多机设置,您可以查看 Apache Nutch .但是,Nutch很难学习.

For a multi-machine setup you can take a look at Apache Nutch. However, Nutch is a bit difficult to learn.

再现设置后,我能够以动态方式回答您有关添加种子页的问题.

After reproducing your setup, I'm able to answer your issue regarding the addition of seed pages in a dynamic way.

以这种方式启动搜寻器

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

将调用每个搜寻器线程的run()方法.在研究此方法时,我们找到了一个名为frontier.getNextURLs(50, assignedURLs);的方法,该方法负责从边界获取不可见的URL以便对其进行处理.在此方法中,我们找到了一个所谓的waitingList,它导致线程等待.由于在控制器关闭之前,从未在waitingList上调用notifyAll,因此线程将永远不会重新计划新的URL.

will invoke the run() method of every crawler thread. Investigating this method, we find a method named frontier.getNextURLs(50, assignedURLs);, which is responsible for taking unseen URLs from the frontier in order to process them. In this method, we find a so-called waitingList, which causes the thread to wait. Since notifyAll is never invoked on waitingList until the controller is shutdown, the threads will never reschedule new URLs.

要解决此问题,您有两种可能的解决方案:

To overcome this issue, you have two possible solutions:

  1. 只需在每个线程中添加至少一个URL作为种子点.不会发生死锁情况.在非阻塞模式下启动线程后,您可以根据需要添加种子.

  1. Just add at least one URL per thread as a seed point. The deadlock situation will not occur. After starting the threads in non blocking mode, you can just add seeds as you like.

controller.addSeed("https://www.google.de");

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

controller.addSeed("https://www.google.de/test");

controller.waitUntilFinish();

  • 选择Github项目的分支,并修改Frontier.java的代码,以便在动态添加种子页面之后可以从CrawlController调用waitingList.notifyAll()方法.

  • Go for a fork of the Github project and adapt the code of Frontier.java so that the waitingList.notifyAll() method can be invoked from the CrawlController after seed pages are dynamically added.

    这篇关于改善crawler4j的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆