使用Apache星火分布式Web爬行 [英] Distributed Web crawling using Apache Spark

查看：194 发布时间：2016/5/19 23:27:22 apache web web-crawler apache-spark

本文介绍了使用Apache星火分布式Web爬行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是有趣的问题问我的时候，我参加关于Web挖掘一次采访。现在的问题是，是否有可能使用Apache星火抓取的网站？

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

我猜测，这是可能的，因为它支持分布式星火的加工能力。面试结束后我搜索了这一点，但无法找到任何有趣的答案。那是可能的火花？

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

推荐答案

这个怎么样方式：

您的应用程序将得到一组网站的网址输入您的履带，如果要实现只是一个普通的应用程序，可以按照以下方式做到这一点：

Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:

分割所有的网页被抓取到单独的网站的列表，每个站点是足够小，适合在一个单独的线程得好：
例如：你必须爬www.example.com/news从20150301到20150401，分裂结果可能是：[www.example.com/news/20150301，www.example.com/news/20150302 ，...，www.example.com/news/20150401]

每个基地网址分配（ www.example.com/news/20150401 ），以单个线程，这是在真正的数据抓取的
每个线程的结果保存到文件系统。

split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well: for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
assign each base url(www.example.com/news/20150401) to a single thread, it is in the threads where the really data fetch happens
save the result of each thread into FileSystem.

当应用程序成为火花的，相同的过程，但偏偏在封装概念星火：我们可以自定义一个CrawlRDD做同样的工作：

When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:

分割网站： DEF getPartitions：数组[分区] 是做拆分任务的好地方

主题抓取每个分割： DEF计算（部分：分区，环境：TaskContext）：迭代器[X] 将作为s $ P $垫的所有执行程序您的应用程序并行运行。

保存到RDD HDFS。

Split sites: def getPartitions: Array[Partition] is a good place to do the split task.
Threads to crawl each split: def compute(part: Partition, context: TaskContext): Iterator[X] will be spread to all the executors of your application, run in parallel.
save the rdd into HDFS.

最后的程序是这样的：

class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}

class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {

  override protected def getPartitions: Array[CrawlPartition] = {
    val partitions = new ArrayBuffer[CrawlPartition]
    //split baseURL to subsets and populate the partitions
    partitions.toArray
  }

  override def compute(part: Partition, context: TaskContext): Iterator[X] = {
    val p = part.asInstanceOf[CrawlPartition]
    val baseUrl = p.baseURL

    new Iterator[X] {
       var nextURL = _
       override def hasNext: Boolean = {
         //logic to find next url if has one, fill in nextURL and return true
         // else false
       }          

       override def next(): X = {
         //logic to crawl the web page nextURL and return the content in X
       }
    } 
  }
}

object Crawl {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Crawler")
    val sc = new SparkContext(sparkConf)
    val crdd = new CrawlRDD("baseURL", sc)
    crdd.saveAsTextFile("hdfs://path_here")
    sc.stop()
  }
}

这篇关于使用Apache星火分布式Web爬行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Apache星火分布式Web爬行 [英] Distributed Web crawling using Apache Spark

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

使用Apache星火分布式Web爬行 [英] Distributed Web crawling using Apache Spark

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭