火花和Python尝试使用gensim解析维基百科 [英] Spark and Python trying to parse wikipedia using gensim

查看：1422 发布时间：2016/5/22 16:37:50 python apache-spark gensim wikimedia-dumps

本文介绍了火花和Python尝试使用gensim解析维基百科的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

href=\"http://stackoverflow.com/questions/26177307/spark-and-python-use-custom-file-format-generator-as-input-for-rdd?noredirect=1#comment41082418_26177307\">Spark和Python使用自定义的文件格式/发电机作为RDD输入我想我应该可以通过sc.textFile（）基本上解析任何输入，然后用我的还是从某些库自定义函数。

Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions.

现在我特别想用gensim框架来分析维基百科转储。我已经我的主节点和我所有的工作节点上安装gensim，现在我想用gensim建立功能解析这个问题<一个启发维基百科页面href=\"http://stackoverflow.com/questions/21096432/list-or-iterator-of-tuples-returned-by-map-pyspark\">List通过MAP（PySpark）返回元组（或迭代器）。

Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List (or iterator) of tuples returned by MAP (PySpark).

我的code为以下内容：

My code is following:

import sys
import gensim
from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print >> sys.stderr, "Usage: wordcount <file>"
        exit(-1)

    sc = SparkContext(appName="Process wiki - distributed RDD")

    distData = sc.textFile(sys.argv[1])
    #take 10 only to see how the output would look like
    processed_data = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

    print processed_data
    sc.stop()

extract_pages的来源$ C $ C可以在找到https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py 并根据我的经历看来，它应该与星火工作。

The source code of extract_pages can be found at https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py and based on my going through it seems that it should work with Spark.

但不幸的是，当我运行code我收到以下错误日志：

But unfortunately when I run the code I'm getting following error log:

14/10/05 13:21:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <ip address>.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/root/spark/python/pyspark/rdd.py", line 1148, in takeUpToNumLeft
yield next(iterator)
File "/usr/lib64/python2.6/site-packages/gensim/corpora/wikicorpus.py", line 190, in extract_pages
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "<string>", line 52, in __init__
IOError: [Errno 2] No such file or directory: u'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en">'
    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

然后一些大概星火日志：

And then some probably Spark log:

14/10/05 13:21:12 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/10/05 13:21:12 INFO scheduler.DAGScheduler: Failed to run runJob at PythonRDD.scala:296

和

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

我不星火成功尝试这个，所以问题应该在Spark和gensim的组合地方，但我不太明白，我得到的错误。我没有看到任何文件中gensim wikicorpus.py线190阅读。

I've tried this without Spark successfully, so the problem should be somewhere in combination of Spark and gensim, but I don't much understand the error that I'm getting. I don't see any file reading in the line 190 of gensim wikicorpus.py.

编辑：

从星火加些木柴：

EDIT2：

gensim从使用xml.etree.cElementTree进口iterparse ，文档的这里，这可能会导致问题。它实际上包含预计XML数据文件名或文件。可以RDD视为包含XML数据文件？

gensim uses from xml.etree.cElementTree import iterparse, documentation here, which might cause the problem. It actually expects file name or file containing the xml data. Can be RDD considered as file containing the xml data?

火花和Python尝试使用gensim解析维基百科 [英] Spark and Python trying to parse wikipedia using gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

火花和Python尝试使用gensim解析维基百科 [英] Spark and Python trying to parse wikipedia using gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭