火花和Python尝试使用gensim解析维基百科 [英] Spark and Python trying to parse wikipedia using gensim

查看:1422
本文介绍了火花和Python尝试使用gensim解析维基百科的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

href=\"http://stackoverflow.com/questions/26177307/spark-and-python-use-custom-file-format-generator-as-input-for-rdd?noredirect=1#comment41082418_26177307\">Spark和Python使用自定义的文件格式/发电机作为RDD输入我想我应该可以通过sc.textFile()基本上解析任何输入,然后用我的还是从某些库自定义函数。

Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions.

现在我特别想用gensim框架来分析维基百科转储。我已经我的主节点和我所有的工作节点上安装gensim,现在我想用gensim建立功能解析这个问题<一个启发维基百科页面href=\"http://stackoverflow.com/questions/21096432/list-or-iterator-of-tuples-returned-by-map-pyspark\">List通过MAP(PySpark)返回元组(或迭代器)。

Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List (or iterator) of tuples returned by MAP (PySpark).

我的code为以下内容:

My code is following:

import sys
import gensim
from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print >> sys.stderr, "Usage: wordcount <file>"
        exit(-1)

    sc = SparkContext(appName="Process wiki - distributed RDD")

    distData = sc.textFile(sys.argv[1])
    #take 10 only to see how the output would look like
    processed_data = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

    print processed_data
    sc.stop()

extract_pages的来源$ C ​​$ C可以在找到https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py 并根据我的经历看来,它应该与星火工作。

The source code of extract_pages can be found at https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py and based on my going through it seems that it should work with Spark.

但不幸的是,当我运行code我收到以下错误日志:

But unfortunately when I run the code I'm getting following error log:

14/10/05 13:21:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <ip address>.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/root/spark/python/pyspark/rdd.py", line 1148, in takeUpToNumLeft
yield next(iterator)
File "/usr/lib64/python2.6/site-packages/gensim/corpora/wikicorpus.py", line 190, in extract_pages
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "<string>", line 52, in __init__
IOError: [Errno 2] No such file or directory: u'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en">'
    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

然后一些大概星火日志:

And then some probably Spark log:

14/10/05 13:21:12 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/10/05 13:21:12 INFO scheduler.DAGScheduler: Failed to run runJob at PythonRDD.scala:296

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

我不星火成功尝试这个,所以问题应该在Spark和gensim的组合地方,但我不太明白,我得到的错误。我没有看到任何文件中gensim wikicorpus.py线190阅读。

I've tried this without Spark successfully, so the problem should be somewhere in combination of Spark and gensim, but I don't much understand the error that I'm getting. I don't see any file reading in the line 190 of gensim wikicorpus.py.

编辑:

从星火加些木柴:

EDIT2:

gensim从使用xml.etree.cElementTree进口iterparse ,文档的这里,这可能会导致问题。它实际上包含预计XML数据文件名或文件。可以RDD视为包含XML数据文件?

gensim uses from xml.etree.cElementTree import iterparse, documentation here, which might cause the problem. It actually expects file name or file containing the xml data. Can be RDD considered as file containing the xml data?

推荐答案

我通常在斯卡拉星火工作。
不过这里有我的想法:

I usually work with Spark in Scala. Nevertheless here are my thoughts:

当您通过加载一个sc.textFile文件,它是某种这是在你的sparkWorkers分布线迭代器。
我觉得给维基百科一行并不一定对应于可解析XML项目的XML格式,因此你得到了这个问题。

When you load a file via sc.textFile, it is some sort of line iterator which is distributed across your sparkWorkers. I think given the xml format of the wikipedia one line does not necessarily corresponds to a parsable xml item, and thus you are getting this problem.

即:

 Line 1 :  <item>
 Line 2 :  <title> blabla </title> <subitem>
 Line 3 : </subItem>
 Line 4 : </item>

如果您尝试分析自身的每一行,它会吐出就像你的那些异常。

If you try to parse each line on its own, it will spit out exceptions like the ones you got.

我通常有更动维基百科转储,所以我做的第一件事就是把它改造成一个可读的版本,这很容易被消化星火。即:每篇文章录入一行。
一旦你拥有了它一样,你可以很容易地反馈到火花,并做各种处理。
它并不需要很多资源来改造它。

I usually have to mess around with a wikipedia dump, so first thing I do is to transform it into a "REadable version" which is easily digested by Spark. i.e: One line per article entry. Once you have it like that you can easily feed it into spark, and do all kind of processing. It doesn't take much resources to transform it

看看ReadableWiki:
https://github.com/idio/wiki2vec

Take a look at ReadableWiki: https://github.com/idio/wiki2vec

这篇关于火花和Python尝试使用gensim解析维基百科的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆