如何指定分区mapPartition火花 [英] how to specify the partition for mapPartition in spark

查看：242 发布时间：2016/5/22 16:44:59 python apache-spark pyspark partition

本文介绍了如何指定分区mapPartition火花的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要做的是计算每个列表分别所以举例来说，如果我有5名单（[1,2,3,4,5,6]，[2,3,4， 5,6]，[3,4,5,6]，[4,5,6]，[5,6]），我想拿到5列出而不6我会做这样的事情：

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do something like :

data=[1,2,3,4,5,6]+[2,3,4,5,6,7]+[3,4,5,6,7,8]+[4,5,6,7,8,9]+[5,6,7,8,9,10]

def function_1(iter_listoflist):
    final_iterator=[]
    for sublist in iter_listoflist:
        final_iterator.append([x for x in sublist if x!=6])
    return iter(final_iterator)  

sc.parallelize(data,5).glom().mapPartitions(function_1).collect()

再切名单，所以我再次拿到第一列表。
有没有办法简单地计算分开？我不想名单混合，他们可能有不同的尺寸。

then cut the lists so I get the first lists again. Is there a way to simply separate the computation? I don't want the lists to mix and they might be of different sizes.

感谢您

菲利普

推荐答案

据我理解你的意图，所有你需要在这里是保持个人的名单分开，当你并行您的数据：

As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize your data:

data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8],
    [4,5,6,7,8,9], [5,6,7,8,9,10]]

rdd = sc.parallelize(data)

rdd.take(1) # A single element of a RDD is a whole list
## [[1, 2, 3, 4, 5, 6]]

现在，你可以简单地地图使用您所选择的功能：

Now you can simply map using a function of your choice:

def drop_six(xs):
    return [x for x in xs if x != 6]

rdd.map(drop_six).take(3)
## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]]

这篇关于如何指定分区mapPartition火花的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何指定分区mapPartition火花 [英] how to specify the partition for mapPartition in spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何指定分区mapPartition火花 [英] how to specify the partition for mapPartition in spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭