使用pyspark / spark对一个大的分布式数据集进行采样 [英] Sampling a large distributed data set using pyspark / spark
问题描述
我在hdfs中有一个文件,它分布在集群中的节点上。
我试图从该文件中获得10行的随机样本。
在pyspark shell中,我使用以下方式将文件读入RDD:
>>> textFile = sc.textFile(/ user / data / myfiles / *)
简单地拿一个样本......关于Spark的很酷的事情是,有像 takeSample
这样的命令,不幸的是我认为我做错了什么,因为以下事情需要很长时间时间:
>>> textFile.takeSample(False,10,12345)
所以我试着在每个节点上创建一个分区,然后指示每个节点使用以下命令对该分区进行采样:
>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines:blockOfLines.takeSample(False,10,1234))。first()
,但是这给出了一个错误 ValueError:解包的值太多
:
org.apache.spark.api.python.PythonException:Traceback(最近的最后一次调用):
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2 ($ func(split_index,iterator),outfile)
文件/ opt / cloudera /parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py,第117行,在dump_stream中
代表迭代器中的obj:
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py,第821行,在add_shuffle_key
中为(k ,v)在迭代器中:
ValueError:太多的值解开
如何抽样来自使用spark或pyspark的大型分布式数据集的10行?
<$ c 使用sample而不是takeSample似乎可以使事情变得相当快: $ c> textFile.sample(False,.0001,12345)
问题在于它是很难知道正确的分数,除非您对数据集中的行数有一个大概的了解。
I have a file in hdfs which is distributed across the nodes in the cluster.
I'm trying to get a random sample of 10 lines from this file.
in the pyspark shell, I read the file into an RDD using:
>>> textFile = sc.textFile("/user/data/myfiles/*")
and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample
, unfortunately I think I'm doing something wrong because the following takes a really long time:
>>> textFile.takeSample(False, 10, 12345)
so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:
>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()
but this gives an error ValueError: too many values to unpack
:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream
for obj in iterator:
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key
for (k, v) in iterator:
ValueError: too many values to unpack
How can I sample 10 lines from a large distributed data set using spark or pyspark?
Using sample instead of takeSample appears to make things reasonably fast:
textFile.sample(False, .0001, 12345)
the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.
这篇关于使用pyspark / spark对一个大的分布式数据集进行采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!