使用pyspark / spark对一个大的分布式数据集进行采样 [英] Sampling a large distributed data set using pyspark / spark

查看:1489
本文介绍了使用pyspark / spark对一个大的分布式数据集进行采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在hdfs中有一个文件,它分布在集群中的节点上。



我试图从该文件中获得10行的随机样本。



在pyspark shell中,我使用以下方式将文件读入RDD:

 >>> textFile = sc.textFile(/ user / data / myfiles / *)

简单地拿一个样本......关于Spark的很酷的事情是,有像 takeSample 这样的命令,不幸的是我认为我做错了什么,因为以下事情需要很长时间时间:

 >>> textFile.takeSample(False,10,12345)

所以我试着在每个节点上创建一个分区,然后指示每个节点使用以下命令对该分区进行采样:

 >>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines:blockOfLines.takeSample(False,10,1234))。first()

,但是这给出了一个错误 ValueError:解包的值太多

  org.apache.spark.api.python.PythonException:Traceback(最近的最后一次调用):
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2 ($ func(split_index,iterator),outfile)
文件/ opt / cloudera /parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py,第117行,在dump_stream中
代表迭代器中的obj:
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py,第821行,在add_shuffle_key
中为(k ,v)在迭代器中:
ValueError:太多的值解开

如何抽样来自使用spark或pyspark的大型分布式数据集的10行?



 <$ c 

使用sample而不是takeSample似乎可以使事情变得相当快: $ c> textFile.sample(False,.0001,12345)

问题在于它是很难知道正确的分数,除非您对数据集中的行数有一个大概的了解。


I have a file in hdfs which is distributed across the nodes in the cluster.

I'm trying to get a random sample of 10 lines from this file.

in the pyspark shell, I read the file into an RDD using:

>>> textFile = sc.textFile("/user/data/myfiles/*")

and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample, unfortunately I think I'm doing something wrong because the following takes a really long time:

>>> textFile.takeSample(False, 10, 12345)

so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:

>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()

but this gives an error ValueError: too many values to unpack :

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream
    for obj in iterator:
  File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key
    for (k, v) in iterator:
ValueError: too many values to unpack

How can I sample 10 lines from a large distributed data set using spark or pyspark?

解决方案

Using sample instead of takeSample appears to make things reasonably fast:

textFile.sample(False, .0001, 12345)

the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.

这篇关于使用pyspark / spark对一个大的分布式数据集进行采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆