使用Apache星火分布式Web爬行 [英] Distributed Web crawling using Apache Spark
问题描述
这是有趣的问题问我的时候,我参加关于Web挖掘一次采访。现在的问题是,是否有可能使用Apache星火抓取的网站?
An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?
我猜测,这是可能的,因为它支持分布式星火的加工能力。面试结束后我搜索了这一点,但无法找到任何有趣的答案。那是可能的火花?
I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?
推荐答案
这个怎么样方式:
您的应用程序将得到一组网站的网址输入您的履带,如果要实现只是一个普通的应用程序,可以按照以下方式做到这一点:
Your application would get a set of websites URLs as input for your crawler, if you are implementing just a normal app, you might do it as follows:
- 分割所有的网页被抓取到单独的网站的列表,每个站点是足够小,适合在一个单独的线程得好:
例如:你必须爬www.example.com/news从20150301到20150401,分裂结果可能是:[www.example.com/news/20150301,www.example.com/news/20150302 ,...,www.example.com/news/20150401]
- 每个基地网址分配(
www.example.com/news/20150401
),以单个线程,这是在真正的数据抓取的李线程> - 每个线程的结果保存到文件系统。
- split all the web pages to be crawled into a list of separate site, each site is small enough to fit in a single thread well:
for example: you have to crawl www.example.com/news from 20150301 to 20150401, split results can be: [www.example.com/news/20150301, www.example.com/news/20150302, ..., www.example.com/news/20150401]
- assign each base url(
www.example.com/news/20150401
) to a single thread, it is in the threads where the really data fetch happens - save the result of each thread into FileSystem.
当应用程序成为火花的,相同的过程,但偏偏在封装概念星火:我们可以自定义一个CrawlRDD做同样的工作:
When the application become a spark one, same procedure happens but encapsulate in Spark notion: we can customize a CrawlRDD do the same staff:
- 分割网站:
DEF getPartitions:数组[分区]
是做拆分任务的好地方 - 主题抓取每个分割:
DEF计算(部分:分区,环境:TaskContext):迭代器[X]
将作为s $ P $垫的所有执行程序您的应用程序并行运行。 - 保存到RDD HDFS。
- Split sites:
def getPartitions: Array[Partition]
is a good place to do the split task. - Threads to crawl each split:
def compute(part: Partition, context: TaskContext): Iterator[X]
will be spread to all the executors of your application, run in parallel. - save the rdd into HDFS.
最后的程序是这样的:
class CrawlPartition(rddId: Int, idx: Int, val baseURL: String) extends Partition {}
class CrawlRDD(baseURL: String, sc: SparkContext) extends RDD[X](sc, Nil) {
override protected def getPartitions: Array[CrawlPartition] = {
val partitions = new ArrayBuffer[CrawlPartition]
//split baseURL to subsets and populate the partitions
partitions.toArray
}
override def compute(part: Partition, context: TaskContext): Iterator[X] = {
val p = part.asInstanceOf[CrawlPartition]
val baseUrl = p.baseURL
new Iterator[X] {
var nextURL = _
override def hasNext: Boolean = {
//logic to find next url if has one, fill in nextURL and return true
// else false
}
override def next(): X = {
//logic to crawl the web page nextURL and return the content in X
}
}
}
}
object Crawl {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Crawler")
val sc = new SparkContext(sparkConf)
val crdd = new CrawlRDD("baseURL", sc)
crdd.saveAsTextFile("hdfs://path_here")
sc.stop()
}
}
这篇关于使用Apache星火分布式Web爬行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!