使用pyspark / spark对一个大的分布式数据集进行采样 [英] Sampling a large distributed data set using pyspark / spark

查看：1489 发布时间：2018/5/31 18:45:12 hadoop apache-spark

本文介绍了使用pyspark / spark对一个大的分布式数据集进行采样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在hdfs中有一个文件，它分布在集群中的节点上。

我试图从该文件中获得10行的随机样本。

在pyspark shell中，我使用以下方式将文件读入RDD：

 >>> textFile = sc.textFile（/ user / data / myfiles / *）

简单地拿一个样本......关于Spark的很酷的事情是，有像 takeSample 这样的命令，不幸的是我认为我做错了什么，因为以下事情需要很长时间时间：

 >>> textFile.takeSample（False，10，12345）

所以我试着在每个节点上创建一个分区，然后指示每个节点使用以下命令对该分区进行采样：

 >>> textFile.partitionBy（4）.mapPartitions（lambda blockOfLines：blockOfLines.takeSample（False，10，1234））。first（）

，但是这给出了一个错误 ValueError：解包的值太多：

  org.apache.spark.api.python.PythonException：Traceback（最近的最后一次调用）：
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2 （$ func（split_index，iterator），outfile）
文件/ opt / cloudera /parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py，第117行，在dump_stream中
代表迭代器中的obj：
文件/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py，第821行，在add_shuffle_key 
中为（k ，v）在迭代器中：
 ValueError：太多的值解开

如何抽样来自使用spark或pyspark的大型分布式数据集的10行？

<$ c
使用sample而不是takeSample似乎可以使事情变得相当快： $ c> textFile.sample（False，.0001，12345）

问题在于它是很难知道正确的分数，除非您对数据集中的行数有一个大概的了解。

I have a file in hdfs which is distributed across the nodes in the cluster.

I'm trying to get a random sample of 10 lines from this file.

in the pyspark shell, I read the file into an RDD using:
>>> textFile = sc.textFile("/user/data/myfiles/*")
and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample, unfortunately I think I'm doing something wrong because the following takes a really long time:
>>> textFile.takeSample(False, 10, 12345)
so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:
>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()
but this gives an error ValueError: too many values to unpack :
org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream for obj in iterator: File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key for (k, v) in iterator: ValueError: too many values to unpack
How can I sample 10 lines from a large distributed data set using spark or pyspark?
解决方案
Using sample instead of takeSample appears to make things reasonably fast:
textFile.sample(False, .0001, 12345)
the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.

这篇关于使用pyspark / spark对一个大的分布式数据集进行采样的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用pyspark / spark对一个大的分布式数据集进行采样 [英] Sampling a large distributed data set using pyspark / spark

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

使用pyspark / spark对一个大的分布式数据集进行采样 [英] Sampling a large distributed data set using pyspark / spark

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭