使用Apache Spark进行分布式Web爬网-有可能吗? [英] Distributed Web crawling using Apache Spark - Is it Possible?

查看:82
本文介绍了使用Apache Spark进行分布式Web爬网-有可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我参加一次有关网络挖掘的采访时,一个有趣的问题问我.问题是,是否可以使用Apache Spark爬网网站?

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

我猜想这是有可能的,因为它支持Spark的分布式处理能力.面试后,我搜索了此内容,但找不到任何有趣的答案. Spark有可能吗?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

推荐答案

这种方式如何:

您的应用程序将获得一组网站URL作为您的搜寻器的输入,如果您仅实现普通的应用程序,则可以按以下步骤进行操作:

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

  1. 将所有要爬网的网页拆分为一个单独的站点列表,每个站点都足够小以适合单个线程: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
  2. 将每个基本url(www.example.com/news/20150401)分配给一个线程,它是在真正获取数据的线程中
  3. 将每个线程的结果保存到FileSystem中.
  1. split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
  2. assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
  3. save the result of each thread into FileSystem.

当应用程序成为Spark应用程序时,会执行相同的过程,但将其封装在Spark概念中:我们可以自定义CrawlRDD来执行相同的工作:

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

  1. 拆分站点:def getPartitions: Array[Partition]是执行拆分任务的好地方.
  2. 用于抓取每个拆分的线程:def compute(part: Partition, context: TaskContext): Iterator[X]将散布到您的应用程序的所有执行程序中,并以并行方式运行.
  3. 将rdd保存到HDFS中.
  1. Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
  2. Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
  3. save the rdd into HDFS.

最终程序如下:

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions
    partitions.toArray
  }

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false
       }          

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X
       }
    } 
  }
}

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)
    crdd.saveAsTextFile("hdfs://path_here")
    sc.stop()
  }
}

这篇关于使用Apache Spark进行分布式Web爬网-有可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆